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Instruction Set Translation Method 

1 Problem Description 

For a new processor architecture to be successful a large quantity of support tools and 
soft^^re .s required. Conipilers must be made available mat target the Sct^ al^n 
SS..n ® f ^'^^ ""^"^ A debugger Is r^quir^to Stow proS^I t ^ 

. debugged while running on the architecture. Modem debuggeiB need tosuMSTLhnS 

^ ""^^ ^" ^^^^ ^ ^ view^^l^nSnaT^rT^de 
fn??p'^.^rr? ^ ^ '"^"^ development environment TesTe^r^Sr 
and debugger tools into powerful GUI based environment. If software enaln^rs^JnS 
d^v mon environment with the tools they are^seTto uTing fromTav to 

day men this rBpresents a significant bamer to the adoption of a new arohitSe ThP 
devetopment of such an environment and associated tools represe^te^nv t^^" J^n 

Sl^lSr ' ^^^'^ and^TLTbe^C^eVLT 

Lwt.S'I!^? ^'"l"^^® retargeting of existing compilers and debuggers is particularlv 
^blematc as the processor has no fixed instruction set. Since CriSL doS rS 
have a fixed instruction set it does support any kind of assembly SSa^^^^ 
compiler and debug tools are designed to be targeted at a SeHrchitSe anri^nS 
extensive modification to cope with a variable arohlteSjre A ^^^^^^^^ 
rep^entaton is needed to hide me variabilHy of L^^^JTre ^m'^SlstlnTS!^ 

devlloomen/^^^^^ ""^^ ^'^^'^^ CriticalBlue to minimise the amount of 

development effort and to promote adoption of the processor The traiectorJ mL= 
existing, and therefore familiar, software development^ ^^e LdS 
representation is in fact the machine code for an eSg p^sso7^?hS all Se SmoflS 
S "or Ses^a^J^^^^^^^ ^ useTl^e^C^c^llirr; 

ra%uSoS^^?Sre p^^^^^ '"P"* ^ - --"table 

DaSc^LT'SLS;"'* *° S"^ 9^"^ted for ARM and produce code for a 

me oriSna?2^M SSrNT'°' '""^^ ^^^^^ --^P^^d"^^ the same^sulte as 

tne original ARM code. However, me focus of me CriticalBlue architecture is to nrnviJ^ 

supenor perfomiance on partculariy key parts of application. EveTmou^^ mis 

code IS expressed in ARM machine code me Criti Jsiue code geherator nrlst be S to 

reorder and schedule me individual operations as required to SreSe use of me 

innovative archrtectural features supported by CriticalBlue. Thus^dlJiduro^L^^^^^^^^ 

be scheduled in a completely different order to mat of me original S cSle^ 

Jo^ttlif'f ®"^'';°""?ent must support me use of ARM debugging tools In order to 

fndif ^M ln^*^-„^„t ^^-"tion is eqSfvaPen? to mat of me 

inan/idual ARM instructions. The engineer must be able to use all me features of m^ 
debugger. This includes me ability to set a breakpoint on any sfr^^gteA^ i^tnSlST 
^ZT^Z^lZ^^'^'^f- ^ ' ' reachedU partlSlar nsSo^; 
^te^ld -r^^ZrZT^ "^"^^ correspond to me values mat would be 

expected if me code were mnning on a real ARM chip. The debugger mav allow the 

STstSfmuS !^ ^l^"^ insm^ction 'granularity. The eScts S ea^h 

mpmn!?^ w w * ^ ^f^^^^^ '"'^^'^^ t®"^ «ieir effecJ on register state^d 
i^dEiA^M ^'^^^^'^'^ to be artJitrarily restarted frmf ^y 

individual ARM instruction in me entire program. ^ 
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on a OrftioalBlue oTCe^^' ?> •''9'' levels of parallelism 

t^. Sro,'«s''rSs?^vSrS.r^L„"^ ^ ^P"-* 

2 Prior Art 

architecture to be executed on TrwShlf 1^ built that allow binanes written for one 
enough performancJ ^^^e SeS^n SStilf'^'^-S ^^"^"^^ ^^^''^^''^S high 

particular' idiosyncrasiS of ^fe SitS^f^^" JTolhr^'^'^-^^^r °" 
performance. arcnitecture on another significantly degrades 

Unfortunately this methS Tve^ stoT^ IST^^*"^ ""^ °^8inal arehiteoture. 
embedded ^stems^tJ^ ^ cSlTue K^Ir^m- " T"?^' *^ type of 
.ranslated ^ fo maKe"„se ofle^^^SaTarSSS^STB^Xr ^ 

d'^^arKSnSs^lnrSISr a""" «e« has beenin the area of 

^sX^^hoTte-^I^SS^^^^^ 

amount of time is devoted to oerfomiinn ^^IL- ^ " ^® systems an increasing 

the cache if It Is fre^en^^eS T^^ P^"^'^"'^' "'^^ ^^^^"^^ 

optimisations on m^/ S^^c^^ ^^-^^t computationally 

very exact emulations S'SSIvSfSr^Tn'J^"^^^^^ ^^'^""^ 
handle in translation For inSnS 2 ^^ JZ !. normally very difficult to 

any affected code f^ i^e^^'^l^tTL^^ ^ ^^^"^ ^''"P'y fl"shi"g 
accesses that genS^ DSf^i?^?^?..^ P^^^^^^ exceptions (such as memor? 

dynamio translation s^temrStoXd hv ^^^SiJ^^ ^Went 
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mstructiorrSet Translation 
Dynamic translation is less suitable for embPrtWoH or.«,.».*.. 
Is a significant memory overheS S^t^^ylS?lSte^^^^ 

required in order to achieve good Sfomance ^n<^^^^^ ^^J'^ ^'^^ °^ ^ ^^"^^ 
not provide sufficiently detelZisfcbThavto; ^^^^^ 

embedded real time ^vironrJ^SVeS a gia?^^^ ^J^^'^''^ '^P^^"* 
the application is translated irto^e cSie T"^^ ^^'^^ ^"^ 

importantblockofcodebecor^Sevi^StrnmeL^^^ '^'"^^ ' ^ 

'^T^P^^i/Jl^.^^lfT^^ t^rislation techniques have 

number of "^dSnta^s ?o^^^^^^^^ ^^^'^"^ beca"«e they have I 

significant to the appliXns t^ZtZt^^ ^^^""'"f^ °^ P^^'^"'^^ 

unable to deal with self rr^lS STZS ST^?^""^- ^^f^: static techniques are 
Code is often stored in ROM S camot i?mo<S2 ^ embedded environments, 

is unable to provide soppo^to^trSSl ^^^Tl^- S!'^^"^'^' ^^^^'ation 
the exact results that would SSSd^^^n^?rL J '!J^^ translation cannot produce 
is Significant if virtual memo^ stioooSi'^^^^^ "^""^ ^" ^^P«°"- ™s 

might cause an exception if fr^rrifa rSie ■ T "^T^ ^"^^ instruction 

. system that brings in the Sr?d rS Thh ? ^ opeisUng 

access. Fortunately. CritiSlB^dSs^ot nS2 ^nbmes execution after the memor? 
applications are consid^^^ o^^^^ ^'^^ "^^"^^'^ as the targ2 

translation techniques cannot fondle ^ndT^^ ^"P""''- 
They need to have full knowledge of e^S^^l^^^^ ^J^*^ P^^^ °^ ^ P^^Sram. 

is not generated by compTerofl^ Sn^Itrh^'" "^^^ ''°rtunately, such code 
are designed for pragrar^ml^g frorhiT^lff^^^^^^^^^ Cri«calBlue processors 

a particular problem. ^ languages so this.restriclfon does not pose 

ISle SlTp^rJSi^^^^^^^^ ~ ^-'f - systems have not been 

CriticalBiue for it toTp^Tlxr^tinrSi^^^^^^ " ^^'^^'^"^'y ''"Perative for 

tr.nsla.on approach are p^Larily in^^e al^w^:^^^^^^^^^^ "^'^'^^^^ 

^c'^orS'hL"^^^^^^^ 

codes are made availabte andire us^ to w"'"''^' °^ """^^ °P«*n 

execution logic has to int^ri?d int?th2 nfn ? instruction. The instruction 

operands and write .S^uTte T ^n^^'^^!"^ processor in order to receive 
the case of the Ten^(^ so^So^Io^ T^'^^'^" "^'^ 

instruction set and tool^aTjhe t^!^,Sn h» nnS^Lf ^ processors have their own 
accessed thraugh t^e assenSef u^^^^^ be updated so that the new Instruction can be 
compiler is not able to .Ste^e of T npw mf ♦'^^'^ mnemonic. However, the 
available from a highL^lano^aae bv m«n. .f'^^J;°" "^^ 'nstruction is made 

that invokes the r^uirS inSS sLTth ^ "^1"^ ^ T^^' ^" assembler 

do not suffer ftom ^.n^mTora^^^ ^ '^^^^'^ ^hain they 

fixed, instruction ^t. CntrcalBlue solution that uses a 3-. party. 

3 Summary of Contribution 

registers reads and vS^ arSlSSSd ^t^ J^f "/^"^ °" CriticalBiue processor. All 
operation is convertedinto op?rati^o iLTf^^ operations. Thus a 3-address add 
operation ^e. and -al,y°-~^^^^ 
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Instructions with complicated addressing modes or that modify the condition codes result 
in longer sequences of basic operations. 

A number of source instnjctions will generally be translated to fomi the contents of one 
individual strand within a region. In most cases a strand will contain all the operations from 
a basic block in the original code. The code \mII contain many reads and writes to the 
register file. To achieve high perfomiance on the CriticalBlue processor (since it has 
limited register file connectivity) many of these accesses vAW be optimised away. For 
instance, if an operation generates a result that is written to the register file and 
subsequently read for use by a subsequent operation then code is optimised to write the 
data from the producer operation to the consumer. Register writes are preserved if they 
are writing data that \n\\1 be subsequently used in other strands or regions in the program. 
Thus at the completion of strand execution the register state will be the same as on a real 
ARM chip for all registers holding valuas that might be subsequently used. Registers 
holding unneeded data may differ in value. 

Thus if the program execution were to be stopped on a breakpoint on the boundary 
between strands then all the important register (and memory) state would be the same 
that observed on a real ARM processor. Of course, breakpoints can be set on any original 
instruction. Reducing the size of a strand to a single source instnjction would dramatically 
reduce optimisation opportunities and thus the performance of flie processor. 



To overcome this problem, two versions of all regions generated for a CriticalBlue 
processor are produced. The main execution region is generated as discussed. A 
secondary, shadow, region is also produced. This contains the same translations for the 
same original Instoictions in each strand. However, in the shadow version, the individual 
register read and write operations are not optimised away. Moreover, the operations for 
each original instruction are performed in their original order. Each block of operations 
corresponding to an original instruction is bounded by a breakpoint check operation. This 
takes the original address of the instruction and compares it against all current 
breakpoints. 

Break pointing on an individual original instruction is achieved as follows. A break point is 
set by specifying an original program instruction on which to halt. This is converted, using 
a table, into the address of a particular region and strand In the translated version of the 
program. The specified strand contains the translated form of that particular instruction. A 
breakpoint is set on the CriticalBlue architecture on that particular strand. Each region in 
the main code contains a conditional branch to the con-esponding shadow region. If the 
break pointed strand is about to be executed then that branch is taken to the same strand 
in the shadow code. At the point of the branch the registers and memory contents will 
con^espond to that state of a real ARM processor at the first instoiction in the strand. 
When the shadow code is entered it is executed until a match occurs between the original 
PC value and the breakpoint unit. In the shadow code all the effects of the original 
instruction are faithfully reproduced. When the actual breakpoint is reached the machine 
state \Aflll present that expected at the particular instmction. Thus execution proceeds at 
full speed until the strand containing the breakpoint instaiction is encountered. Execution 
then proceeds more slowly on an instruction-by-instruction basis until the actual break 
point instruction is reached. 

Single stepping is perfonned by executing the shadow code while break pointing on every 
original instruction. At the end of each region a branch is made back to the main code. 
Thus when execution is restarted it initially executes the shadow code but a branch is 
made back to the main code as soon as the region execution is complete. Execution can 
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be restarted at any arbitrary instruction in the program by executing from the appropriate 
instruction in the shadow code. 

The shadow code contains all the required operations to faithfully reproduce the behavior 
of the original ARM code. It is not heavily optimised like the main execution code. Thus 
the shadow code will be several times larger than the main execution code and, indeed, 
several times larger than the original ARiVI code. In an embedded system y^th limited 
memory this poses a problem. There not sufficient memory space to store both the 
main execution code and the shadow code. The shadow code is only required when 
debugging is being perfomned, to support break points and single stepping. Thus during 
normal usage of the application only the main code is required and the code size Is 
comparable to that of the original ARM version. 

When the application needs to be debugged ihe shadow code can be held In memory 
connected externally to the system under debug. A slow bit serial interface can be used to 
connect this external shadow memory to the CriticalBlue processor. Since a serial 
interface is used the pin and silicon area overiiead of the interface is very small. The 
disadvantage of a serial interface is that the access speed to the memory will be 
extremely slow. In fact the access speed could be two or three orders of magnitude 
slower than that to the main memory of the CriticalBlue processor. However, the shadow 
memory only needs to be accessed when a break point is encountered or single stepping 
is being performed. Moreover, only a small number of accesses need to be made during 
these events. Thus the performance degradation due to the slow access speed of 
shadow memory will not be noticeable and does not aifect nomnal system perfomnance. 

Instruction set extension is supported by converting calls to particular software functions 
into an invocation of an extension hardware unit. This is achieved as follows. The 
engineer designs a hardware unit that perfomis the same operation as a particular 
software function. That is, it takes the same parameters and produces the same results 
as the code in its software equivalent. The advantage of the hardware version is that it will 
be able to produce the results more quickly. The engineer adds the function to a list that 
are implemented in hardware. The hardware function is given the same name as the 
equivalent software function. During translation, whenever a call to the software function 
is encountered it is converted Into an invocation of the hardware unit. The parameters that 
would be passed to the software function are passed as the operands to the hardware 
unit. The results that would have been returned by the software function are obtained as 
the results from the hardware unit. In this way the effective instruction set of the 
CriticalBlue processor can be extended as required. The hardware functions can be 
accessed directly from a high level language just by calling the appropriate tlinction. 
Moreover, this can be achieved without having to modify or extend the fixed instruction 
set of the source architecture. 



The purpose of the software/hardware interface is to allow engineers to specify the 
boundary between hardware and software in a system. Software languages do not 
normally have to provide any facility for specifying that boundary. It is an intrinsic 
assumption that the underiying processor hardware will support a set of basic operations 
(such as addition, memory access etc.) that makes implementation of the software 
possible. All software is converted into a sequence of such operations by a compiler. 

The CriticalBlue processor also has intrinsic hardware support for such operations. The 
support covers the hardware units required for implementation of all the instructions for 
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Any particiJlar software function may be annotated to Indicate that It is actuallv 
implemented ,n hardware. Software functions are used as an abstractirnVa.!, op^^to^^ 
actually .mplemented in hardware. Such functions must not have any sidTeffe^^^SS^ 
as the alteraton of global variables or other areas of memory. sinL such SratioS 
^^liZ^nn'"^^ implementation in a hardware -unit Each function tak^a mmteJ of 
parameters and produces one or more results based directly on those input parameters. 

A function is indicated as having a hardware implementation by simply using a soecial 

redf^icTit^^r "il^'^'"" ^"^ *° ^"^"^ functions' Ld almaS 

ruDJSe?^h^,rS.°^ hardware unit. Thus hardware extensibility fe 

supported without the need to extend the software programming languaqe This 

assrb&tp:^g^.Ss"^ ""^'^ ^" - ^'^^ 

!mn.?n^fnt.^^ fneraily wrapped in a functional irterface. A 'SSe s^ster^ 
InteISS "^"^"^^ "° requirement for an intemiediate r^S 

If C++ is being i^ed then CritlcalBlue can make direct use of the class abstiBction A 
class can be used to describe a particular hardware unit. The individual method? defined 
in the class comespond to operafions that the hardware unit can perlbrm 

inS[n^*'wr''^"*'^'^u"""^. "^^^ *° P"^®'y computational In nature. For 

instance ftjnctions can be written to read and write to an array This corresponds to m 
additional memory unit in the hardware of the processor. This is especSrSl ff exS 
memory units need to be defined to improve overall memory bandK for si^^ 
processing appl.cations. Memoiy units can be defined of any size and widm itow 
arbitranly complex memory architectures to be built directly from C/C++. 

2fin^^m^° T ^i"® P^""^' °^ '^'9^ 'anguages to allow functions to be 

defined that effectively extend the language directly. There is no need for the explicrt 

^^I^hT^^ ^^"^'^"y characterized by architectures that nSd to 

be extended using a register access model. In effect the user Is defining the instrucSon 

ml i^^'^TT^^'' "^f to implement the instmctlons Function 

map directly to the usage of those instmctlons. 

4J2 Execution Unit Model 

nf ilc'f i'^"!^^'®.^''"*'^" implement one or more Individual functions. Each 

of these ftjnctions is temied a method of the unit. This conesponds to the temiinology of a 

rSi,fRi'^'''"'^*'°" ""^^ ^"-^^ ^ 3s%ie languageTrJogram a 

CnticalB lue processor then classes may be directly used to model a ha^ware unrt v2th 
the fijnction members coniegponding directly to these methods. 

The diagram below shows the basic model of an execution unit 
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The underlying model of an execution unit is as a synchronous, pipelined unit. This fits 
well with the computational model of units within a processor. A unit is able to accept a 
number of operands on a particular dock cycle and vwll produce a result a number of 
dock cycles later. This delay Is referred to as the latency of the unit The latency of the 
unit is representing by the vertical anraw on the left hand side of the diagram. It is 
expected that the unit is able to accept a new operation on every dock cyde. If necessary 
a blockage can be set for the unit that prevents it accepting another operation for a certain 
number of clock cydes after the last one. 

The operands upon which the unit operates are shown as the input anows at the top of 
the diagram. The types of these operands are completely user definable. They 
correspond to the parameters provided to the member functions defined by the unit. 

An additional operand is automatically generated. This is the method parameter that 
indicates which individual method should be executed by the unit. It selects from one of a 
number of different methods supported by the unit. Different methods may operate on 
different sets of operands. The input operands to the unit represent the union of all 
parameters of the different method functions supported. Only those pertinent to the 
method being used need be set. The CriticalBlue tools handle this automatically. 

The outputs from ttie execution unit are shown as the arrows at tine bottom of the 
diagram. These con-espond to either the return result from the methods or reference 
parameters. Reference parameters allow multiple results to be returned from a function. 
Thus a single operation can generate multiple results using Oils model. For instance, a ' 
division unit could return botii tiie division result and tiie remainder as separate outputs. 

Certain input operands and output results from an execution unit may marked as being 
"dedicated". That is, tiiey are connected to certain ports on other particular hardware units 
in tiie Criti'calBlue processor. For Bistance. ttiey might be connected to a cohtir)l unit in tiie 
processor or to otiier user defined execution units. This mechanism allows hardware units 
to have an influence on ttie control flow of the processor. The mechanism may also be 
used to create custom communication patiis between hardware units, some of which may 
be outside the CriticalBlue processor itself. Nonnally tiie input operands and output 



7 



CriticalBlue % InstruotMet Translation 

^^mN^nr^ hardware unit are connected automatically as directed by the data flow in 
example programs. Designated operands and resulte are only connect^ to th^ 
specifically indicated ports on other hartlware units. • connected to the 

2fe ZTn^ ^ ^ synchronous pipelined unit with a fixed latency 

This IS a good fit for most types of computational and logic units that will brSed in 
processor. If required, asynchronous operation can be Supported by ^e^ a ^jCTr^ 
r^S^ IfX" • "^^"^^ "^^d to initiate an o^raln 

rnStJT ^ ^ ^ '"d'^tes when the operab^n harCeJn 

completed. A software loop can be used to poll the unit until the resutt Is readv FmX.? 

nSSf '^H*'^^"- "T^ *° the full resur^iTsS fSr^T^tunrtTa 

Srwlf^f"® ^ P^^^^"'^'- then an interrupt mSbe^usS^v 

rri^^i.R^ ^"^^ "^^y ^^^^ directly as an inte^pt sou^ foC^^ 

prSrSS^t^h^ndr^^^^^^^ --'^ P™ a 

43 Parameter Widths 

]iS «nf R h^l^'f"^ to defining execution units that only operate on 32 bit (or perhaps 
It f. ^ '^''""^ ""'t^ ^^at can be defined that operate on 4 12 or 24 

^ioSLf f^Ahf ^^^^^'J' ™^ "^"^ ^o'^ efficfenTie (rf'harcLr^ 

resources. If a 4 bit operand is used then only 4 bits are physically reuted to the eS)n 

TTie width of individual parameters is specified as a suffix to the name of the function Anv 
return parameter and then each of the function parameters has a numberipeS in^e 

orltnH ' ^ ^""^ ^ ^-^-d 32 and indicates me brt S^o? S^e 

S wnhi^t ' parameters must be passed to the behavioral model irmust Lsk 
off higher bits not passed to the real harxlware so that it generates the ime iSjIt fS 

=^.re'^^Sr^ ^'^'^ °~ ac^u^pITen^ 

!I m!.="f h T^^r * w^^^^ P^ra'^eters of larger than 32 bits to a harxlware unit then 
n.^mifo "^""^ parameters of 32 bits or less. H^ver me S 

parameter may be encapsulated into a stmcture or dass deciaratinn LiTlI L • .• 

k ^rSnL parameter to a hardware implemented ftjnction. Since tfie iSir^er foSn 
^•tS^ 1^1 ^f'^X ^"^^ subdivision of longer para^^tera te 

m.?mnio « J"® ^"versely. such a function may also be used to Ssembte 

multple results into a single data Item of more than 32 bits assemDie 



4w4 Behavioural Modeling 

mcS^^JTilTf ^ implemented in a hardware unit fomis a behavioral 

model That is it descnbes the operation of the execution unit. The behavioral code k 

nn'^S'T f ^""^ """^^ hardware wSid Such Se fe 

executed.on the host system during simulation. The code may access I/O or nbii^^ 

^iS^nt io^S.?nT °? '^'T ^^^"«°" units might 

The actual implementation of the hardware is generated separately from the behavioral 
implementation. Any development methodoloiy may be emfJoyefas lonraT^^^^^ 
behavioral model and hardware implementation remain ejui^llent NoSalfy. ^ 
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implementation is obtained by rewriting the software version into wni if o^r, 

Behavioural models are implemented as untimed That te • ■ ^ . . 

synchronously and results are fmmedS^J^v Shfrn^ .® ^^3^'°^! model Is called 

is not called until at teaS a ^rtl?n nnmh^^^^ \^°'5f ^° ^ °"*P"t "^thod 
function that is dedar^lat tSe nrefix hw T t '"P"* ^ny 

4-5 Use of Operator Overloading 

to p«*n. such an ind,^ ^^s^Se'^siS'^Sa^jrSy" " "^'^ 
Srft,„^?^o"KoSd ^^"^ «» is made *eoBy and the overt«ad of a 

these addtonal memoir units when they hold complex dynan* data 
4.6 Function Name Prefixes 



CriticalBlue ^ InstructiMet Translation 

iSS^® ^'^'^^ ™* permitted, as they cannot be called 

4^7 Fixed Function Names 

A number of fixed identifier names are used for ftjnctions that have specific meanings in 
£L tefo'i '"^"'"^ P^^- A" ft^ced idenS a^S 

4*7.1 Entry Point 

Illdi«onri™®H ■^^''^''^ i"?*^*^ P°'"* *® processor upon initialisation. 

Add t onal code is generated at the start of the entry function to set up ttie InterruDt and 
monitor vectors specified using other fixed functioSs. Fast intemipt ha\SlS?^ So 
loclced down in the instmction buffer. ■'oiwibis, are aiso 

^ "kely to be partially implemented in assembler code for the host 

For^'- ^"^^ ^ '""^^ "P '"Kial value of the stack c^inter 

systems having a main CPU then entry point call will be initiated by a remote Kn 
^11 from the mam processor. This will pass infomiaton such as the location of Se SadkTf 

rxecSX^^T^Se^S^i" "r'" ^l'''^' automatiilly 

placed at address 0 in the buffer. Bcecution commences from address 0 after reset If a 
afte^iS ^ ^^^^^'^ automaticaBy Into the instmctlSn blrffer 

4*7.2 Software Intemipt Handler 

Thfe identifier hw_swi is the entry point for a software inteniipt handler. It is called if a 

S^nl^sSrhX"^ *° ^^^'^ -e^titsm^t 

SSiS In me '^t? handter certain registers are presented using allocated 

S^oH Jri«? ^? ^ ^ processor, this implements the effect of the 

banked registers made available when entering a software interrupt handler. 

4.7.3 Hardware Intemipt Handlers 

Interrupt handlere may also be declared as functions. These are defined as ordinan/ 

vofZ^Lo^^tT®' *^®.Pr°^^s°'-- Extra code is inserted around the Junction to preserve 
volatile registers in specially allocated registers within the register file. 

mJl^^t f° ^ """*er of cycles fi-om the point of 

the request. Since the handler is called before the current region has finished exSnq 
current values have to be held in shadow output registers fSm the functional S 
^'^"J; mn n"!?; ^^"^ ^"cf ^'i^'^?)^ "^^^^"^ ^° ^upport intemipts. Multi-cycle units 
mte H^'^t^^- ^^l""^ P'P^""® interrupt handler was 

Sifr^r^h °^ These may then be replayed on 

return from the interrupt handler. ^ ^ 
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L?SSSlnf £nT®°*' ^T^"*^- T "li"? °^ 'nterrupts. last and slow. The code for the 
™ handler functions is loaded Into the instruction buffer arid locked down 

pemianently so there is no latency associated with loading the code from main memory 

whiU^Z' ' ^"'"'^^^o" P"^«s °r other funS 

n^t h^uoT '^'^''i? tme dependent interaction with the hardware. Slow interrupts do 
not h^e their handlers locked Into the instruction buffer so there may be a delay whHe the 
required code Is loaded. Nested Interrupts are not pemiitled. 

The naming convention is to use an Wentifier of the fomi hw_fast int o to 
hw_fast_int N where N Is the maximum number for fast interrupts. Sii^larly'slow 
mtenrupts are denoted by the names hw_slow_int_0 to hw_slow int n where N is 
me maximum number. If a particular handler function is not defined then Tntenupts from 
the corresponding source are ignored. ^ 

The code defined for the handler is executed when the corresponding intemjot is 
received. No^e Oiat. unlike the code for hardware-implemented fonctlon H nS 
behavioral code. It is actually executed in a real system. 

4^7.4 Monitor Interrupt Handler 

ill; fh?Hlhnf "^""^ °^ ""^^ ^""^ processing Interrupts generated 

by the debug communications unit The special identifier hw_monitor_int is used. 

A monitor handler operates in exactly the same manner as a slow Interrupt except that the 
f ° ^'^\r?^or mode. This allows the handler to access Sna^ units 
that are only accessible to the monitor such as the instruction predicate unit ar^d me 

JomT?Hr"''^'°"^ ""^.^^ '"*^"^P* generated when new data is revive? 

dulTng debug communications unit "mis Is a debug link with a host computer u^ 

4b7.5 Monitor Stop Handler 

J^fn.T'!c'*f , ^^''w ^^^'^"^ "^^^^ executing shadow code and no destination 

^dpt L^^ff ^"^^T^ ^ ^'^^'^P°'"* encountered. Shadow 

code IS executed up to the breakpoint instruction in order to create the required machine 
state. The monitor stop handler is then entered. This can then communicate S Se 
debug host computer using ttie debug communications Interface. The spedal identifier 
hw_monxtor_stop Is used. ' -moiiuuci 

On entry to me function me volatile registers are saved to some special preservation 

S'Sfn? Jni'^ "^^^ '^9'*^*^^ "sed during an 

interrupt since me monitor may be entered wimin an intermpt routine for debug purposes. 

4.8 Example 

An example software definition for an execution unit definition is shown below: 
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class hw_24BitALU { A C++ class definition groi4>s a mmber of lu^ 

slr^e hardware unit The class is identified as a hardware unit by the hw_ prefix. 

public: 



int hwjadd24 (int x, int y) { 

return (x + y) & OxEEPFFF; 

} 

int hw_sub24 (int x, int y) { 

return (x - y) & OxFFERPF; 

) 



Method flncdons that ^loUd be performed directly 
Ih hardware. Ihe prefix of hw_ Indicates the usage 
^ ofhardware.The fijnction Imptements a 
' behavioura! model for the corresponding hardware 
unit 



hw 24BitAIiU AIiU; 



A hardware unit may be declared \n the usual way 



function () { 
int Ai b/ c; 

a = ALU. hw_sub24 (ALU. hw_add24 (a, b) , c) ; 

^ Example usage of the functions. These are automatically corverted IntX) 

usage of the hardware unit C++ op«-ator overloading may also be 
applied tD hardware function calls. 

The example defines an execution unit for perfomning 24 bit addition and subtraction. The 
functions are declared as public meniber fljnctions of a class. The functions are declared 
with the hw_ prefix that indicates that it is the implementation of a hardware unit. 

The example illustrates the power of the methodology. Within a few lines of code the user 
has defined a custom instruction for a processor. There is no need to resort to assembly 
language or any complex definition language. What is more, the program is completely 
standard C/C++ and is easily readable by any programmer. 



5 Translation Principles 



5.1 Architecture Choice 

Third party architectures are chosen that are already popular in the application space 
being targeted by the CriticalBlue product. Thus the development tools and instnjction set 
will already be familiar to the target engineering groups for the product, lowering the 
barrier to adoption. The initial intermediate representation to be used will be ARM code. 
Later versions are likely to support MIPS and PowerPC, other architectures. popular in 
System-On-Chip design. 

The translation must be able to take code generated for ARM and produce code for a 
particular CriticalBlue processor. This code must faithfully reproduce the same results as 
the original ARM code. However, tine focus of the CriticalBlue architecture is to provide 
superior performance on particularly key parts of application. Even though this application 
code is pressed in ARM machine code tiie CriticalBlue code generator must be able to 
reorder and schedule tiie individual operations as required to make effective use of the 
innovative architectural features supported by CriticalBlue. Thus individual operations may 
be scheduled in a completely different order to that of tiie original ARM code. 



5J2 Translation Technique 

CriticalBlue employs a static translation technique. Static translation techniques have a 
number of limitations. Fortunately these limitations are not of particular significance to the 
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^ Instruction Set Translation 

Stored In ROM that cannot be morirfiJf embedded environments. Code is often 

provide support Vorp^S ScSf ""^^"^ ^ 

results that would te exo^S H ;^? ^- ^? ^^^^'^ton cannot produce the exact 
slgnifK^antff virtual .Smor^T^s to^^^^^^^ ™s 
cause an exception if there is a oanp tk- • f^^ memory access instruction might 
that brings in'the iS^Ld^n^nT.ir''®^ 

Fortunately. CriHcalBlue do^ nnr^^r^'''^:"^..!!^^ ^^^^^ ^ "memory access, 
applications are considerably lower evdftan thn^S''^ 1^'^°'^ ^'9^^ 

translation techniques canr^^ i^nd P fnrt!^!^^^ ^^""""9 ^"""^ ^"PP^"^- '^^^^V. static 
■mey need to hSi SLge ?eS^^^^2.i?;S^^^ ^j^^ P^^^ °^ « P'^'^m. 
is not generated by compi?^ o?^|^ Sten^ a^irhr*" "^^^ ''°rtunately. such code 
are designed tor progrTr^mlm SlTtevP^^^^^ ^"^'Blue processors 

a particular problem ^ "^"a^^g^s so this restriction does not pose 

ablet ^^o^diLg^^^^ ~ have ^t been 

incorporates a numL? crffnnS^^^t^ --^^^ CriticalBlue technology 

debuggers. innovations to overcome this restriction and allow use of existing 

5.3 Translation Operation 

I^qurnrrmoS^Src^era^^^^^ -^^.ARM instnrction into a 

reglsteiB reads and vwftesTr^ fompd int^ . °" ^ CriticalBlue processor. All 

operation is «)nverte^hto opSra^^s lo^"^^^^ "^"^"^ ^"^"^'^^ ^^d 

operation itself and finaHy S^peiton ?o 1^1 "^^^^ ^P^'^"^ '^9'^*®'^' "^e add 

instructions With complicJL'Sdd°S^^^^^^ file, 

in longer sequences of basic operations ^ condition codes result 

SdSaTstlTS:!^^^^^^ *° contents of one 

a basic block in the oS S.de ScJ^if ^"^ ^" °P^^O"s fro"i 

register file. To achieve htah tirfom^^n^ nn^^^ T^^^^ "^^^^ writes to the 
limited register file cSnneSMtvf^^^Ti t^^^ CnticalBlue processor (since it has 
instance, if an opeSofSrates ^a re!Sf th«5^'^^^^^^ ^ ^^^y- 
subsequently readV use IwTSbinLnf f 's wntten to the register file and 
data from th'e produir Operation t^^^^^^ °P«^'^^ to write the 

are writing data tiiat will be s^bs^lnt^ ?^^^^^^ presen/ed if ttiey 

Thus at the completi^ erf siSndSSbn 1 w"''^ °' "^9'°"^ ^^^^ P^^^m 
ARM chip for a\\ registers hoWim^.p« ^o^^^^^ 1?^ ^^""^ ^s onl real 

holding unneeded date LydSv^L^^^^ subsequently used. Registere 

J^;:^een^t3s'rnISf™ ^ %^--kpolnt on ti.e bounda^ 

tiiat obsen^ed on a marARM 3XL.:i'X''r "memory) state would be tfie same as 
InstruoUon. Reducing meS^^Sd^T^^^l can te set on any ortglnal 

reduce c«on?pport^^^:r«:^^^^rc,r^r"'^ 
5.4 Supporting Instiuctlon Granularity Breakpoints 

^ndary.shad^Snb^l^n^^'^ '^^^ ^"^^ ^ A 
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corresponding to an origln^liSTn Is tound^^^^^^ ^"^^L °f operations 

takes the original adS Tfte irSt^Son ^ '''^'^P°'"t °Peratlon. TTils 
breakpoints. ® instruction and compares it against all current 

progmm. The specified stSnd'SSns iSZsS^T^^^^^ °^ 
breakpoint is set on the OiticalBlue arcSteS^ nrfl?. 2 ^'f * particular instmction. A 

• the main code contains a^Snaf^raS^^^^ Z ^^^^ "^^O" 

break pointed strand is a?oS to^e SeS^^^ ^'on- if the 

in the shadow code. At the point <5Te S^S tX?lnTi^ *° '^"'^ 

conespond to that state of Tr^l ARM nSn2 f ?u^*®i^ ^""^ "'"tents will 

When the shadow c^e fe^teS ft^xeSS f.l^ ^% the strand. 

PC value and the breakoSnf ^it fn ^ "^'^ "^t^een the original 

state will present tfStex^Sd^ftr?«rfSL^^ '^^^^^^ "^a<Sine 

full speed'until the stXrSinTng LTeatooin^^^ 

then proceeds more slowly on an Ins^cfcn hS ^"^""tered. Execution 

point instructfon Is reached '"struction-by-instruction basis until the actual break 

b^etya^^^^^^^ ^"e expliCy testing for . 

to the main code. Thus >^er^Son1I ^.t^^^ .f^'^" ^ '"^"^^ '"^ "^d® back 
but a branch is made back to tS .SZ^^^h "^^^"^^ •* '"^ally executes the shadow code 
Execution can bTresteSi^t^v "^'on execution is complete. 

theappropriateSiSorTfnSe'h'aS^^^ P"^'^'" executlng'from 

5.5 Instruction Set Extension 

intra? rnv^rTL.%SsT^ ""^ ^ P^^-'- ^notions 

engineer designs a hartviSre ur^S hS^^^^^^^ unit Th.s is achieved as follows. TTie 

software funct-on ^at^TSkes the ?amf Zml' '^"J ^ ^ P^'^'^"'^'" 

as the code in its software iSv?ert T?r«H^?n? ^"^^ ^"^^ f^sults 

be able to produce!^ res^fe ^^ a^^ ^IJ^"^'^ it will 

are implemented in haSre ^7 SS^r. filn^-^'"^-^"" ^"^^ *° ^ «iat 

equivalent software fUnSn Du]3ia fr^nSn ^® ^^'"e "^"^^ the 

is encountered it is cSrtS^lfnran hvcSS *° 
would be passed to the s^S ft jnS^ l ^^^^ hardware unit. The parameters that 

unit. The rLulte that wLwS?lSe?SumL^^ ^^I^ *° ^ ^"^'^^^ 

the results from the hartvS-e u^rin^^^^ 

CriticalBlue processor ^nTe eSdlS ?i rT.^ 5^^^^?'^^ instmcaon set of the 
accessed directly from a hiah 2J«M«nn.f ^® ^"'^'^ can be 

Moreover. thKn a^ie?! S^'^^^^^^ Wopriate function. 

setofthesourcTarcM^re *° "^""^^ ^^"'^ ^^^d instruction 

5.6 Translation Operation 
5.6.1 Philosophy 
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Sif of°MSa^pef °" a «=ript can be used that incorporates both the Onk and the 

Since lyietaMapper Is mn after linker it operates on a complete executable TTiere are no 
unresolved re^srences and the locations for all data sectioJis are detemSned N^s^^^^^ 
trrSZ^^H^^^ ""r^ °^ dynamically finking, although such sup^JS is unusuaMn 
embedded development environments ariyway. uiiu«>ucai m 

The executatjie Image provided to MetaMapper should not be stripped All internal 
symbol should be retained as they are used to help locate the start of separate fuSor^ 

""^f """^^^ execiLbles is possilJe Ke SlSS?^^^^ 
the translation caching system is substantially reduced. ^ 

SntS'thf .;i,n 1^1?^ executable image and generates a new executable image that 
contains the translated code. This process never modifies the size of the file containina 

rvmhni t^hf ."^^t^Mapper does not have to recreate the section structure of the file. The 

Xblte tL/Jr°'''"fl^y Pr^^- ^y^'^^'^ -^presenting entry poinfe 

will tocate the address of the onginal code entry point. Since exactly the same format fe 

los^u^pT^lUi^rdebr^^^^ '""^^ - ^-^"93- - orcS 

L^itefli^'^-®'^ CriticalBlue processor with an uncached instruction buffer 
then a separate buffer image file is produced. This is a direct binary reoresentation of th^ 
^de that should be loaded into the instruction buffer on power^up TS^S al Si 

L^StTsandraims. "^"'^ ^'"^ ^^^^ ^ ^^^^ '^^^ 

5.6^ Code/Data Differentiation 

The MetaMapper must be able to differentiate between code and data in the image All 
cn?. TnH H^t ^' ^ translated. In some imagi folate a^ 

m«rJ.H f -f^ ^ ^P^^*^ executable file. These sectio,^^ 

IT H f P'"°P"^*® to allow code and data to be easily differentiated. 

SJid^oH o ^- -^PPf to different base locations in the address map. Nearly J 
embedded environments allow writable data to be held in a separate section flx)rn code 

^^tld^. ~" '"^"'y Obviously ^ITsup^rt 

th?rL?!^f°'''"®"* environments mix code and read-only data in the same section. In 
words'^^i IpTr^?* ^ mechanism must be available to differentiate the two types of . 
noJ^fL? ARM development environment mixes code and iBad-only data. However it 

Ht?^?.®"^"!!' "i^® ^'^'^ s^°^"9 the transitions between code a^d 

data. These mari<ers are read by MetaMapper to enable the translation pixjoess 

5.6.3 Translated Image Areas 

5.6.3.1 Main Translated Code 

fTIU^Jf T'"" °^ ^® executable that contained the original code and data All 

T '!'^;"?'"®d without any modification. Address links are inserted into 
the code at required points to support Indirect branches. 

HrfSil! f being generated for a processor with a cached instruction buffer then the 
onginal executable code is oven«ffitlen by translated code. There is no requirement for 
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any c»rrespdndence between the original code and the translated code it is overwritten 
with. The code does not have to belong to the same function. 

5.6.3.2 Original Code Copy 

A separate section of the executable is used to hold a copy of the original code image. 
This is not used as part of execution and only needs to be present when debugging is 
being performed. It may be held In shadow memory. Only the monitor program accesses 
this data. This Is done when a request is rnade to read the main memory. Equivalent data 
items from this section are returned rather than the actual content of main memory to give 
the impression that the original code image is being executed. Thus disassemblies of the 
code in the image will show the original code. 

The code copy is structured so that only locations that have been modified need to be 
stored. If a particular word is unmodified in the translated code image then data is 
obtained frnm the main memory image. This prevents obfuscation of bugs involving the 
illegal writing of read-only data. If code is being generated fbr a processor vwth an 
uncached instnjction buffer then only the words ovenAnrltten by address links need to be 
stored in this section. 

5.6.3.3 Shadow Code 

TTils section is used to hold translated shadow code and is only presented for processors 
using a cached instnjction buffer. If the instmction buffer is uncached then frie shadow 
code is generated as part of the binary image to be loaded into the instruction buffer at 
power up. 

Shadow code is an unoptimised version of the main code that is present to allow single 
stepping at original instruction granularity. In general the shadow code Is several times 
larger in size then the main code. "The shadow code only needs to be present when 
debugging and can be located in shadow memory. 

5.6.3.4 Mapping Table 

The mapping table Is used to translate original code addresses to the equivalent position 
in the translated code. The monitor program uses this table if an update to the PC register 
is perfomied. This may occur if the user utilises the debugger to jump to a particular 
function in the program. The monitor must determine which location, in the translated 
image, it needs to branch to in order create the equivalent behaviour. The table only 
covers executable addresses. Attempts to load the PC register vwith values outside of the 
executable portion of the image are ignored. 

The mapping table may either specify an address in the main code or an address in the 
shadow code. If the address is in the shadow code then the monitor must used the 
instnjction predicate unit to only commence execution once the required instruction has 
been reached. The breakpoint mechanism is used as part of this process. 

The table is stmctured to make use of the linear nature of translated shadow code. 
Individual instructions can be located using the instmction predicate unit. Within shadow 
code, ascending blocks of instructions ara allocated to ascending strands. Thus y/Min a 
region the table only needs to list the number of consecutive Instructions within each 
strand. 

The mapping table only needs to be present when debugging and can be located in 
shadow memory. 
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Image Padding 

Sin^ MetaMapper do^ not modify the size of the executable, the original executable 
must be large enough to hold the translated code and all the other image a^as Sed 
For an ordinary executable image this will generally not be the case sfn^JfelhaSw 
code IS several times the size of the original code. snadow 

To provide an executable of sufficient size it Is generally padded prior to the link staae 

f^T^^^i ' ^ P'^^S"^ fi"^ ^ """^^ data In oX o pad Se frSge 

to a sufficieritly large size. Ranges of padding files are provided of differirTsizS If the 
image IS not big enough then MetaMapper reports that as an enTand suSSts 

r^SSt^nZtr- " ir^^ ^'^""'^y too SSdSg 

nf SS^ ^ MetaMapper is able to inform the user so that they can optimise ttie size 

tlTtifSS"^^'^ '""^If P^^^'"9 ^ "^^^ with speSTsymbols so 

thm MetaMapper is able to identify them as padding space rather than requi Jd «Sl or 

The padding for areas that may be tacated in shadow memory are provided as a section 
2dow'a,?a "'""^^^ "^^^ *° ^se TZ 

5.6^ Wide Region to Word Region IMapping 

The code that is to be executed on a CriticalBlue processor are fomied into reaions 
contiguous blocks of execution words. The width of these wS T^lr 
definable as the CnticalBlue architecture is able to support many differort wo^S^Sdthl 



l!tn,2'Sin"m'^''*'°" '® '"clJvWual regions must be 

Sci^ dom^nHT." the processor. Regions are loaded into L instruction 

t^ftinT ^^'"^"^.^"""S execution. The width of the main memory is fixed at 32 bits so 

l^m'^nrmo,;?'^" ^ "^"^'^^ ^ ^ ^ ^ ^^rS 

This process is Illustrated in the diagram below: 
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^Region Control 
Wford 



Execution Word Wklth 
(eg. 48 bits) 



Unused Bits 



Word Width (32 bits) 



rBp^entatlon. lndlwuTl2 b t ^^^^^ ^'^ original 

reassembled into its^dXr^i. ^^'^ '^'"^'^ *® ^^*~°«0" buffer ft is 

As illustrated there may be some unused bits witfiin ttie fa-^t xA/nrw «f 
representation. An addftional reoion confrm «J>r<rio ^ f ^ °^ namower 
representation. This conte^s inSo^JSutX^^^^ ^'^^ 
memoiy. TTie individual worels mav S ..h??™^^ '® ^eld in 

control v«o«J also indicSteTml S|f^„t?iZ -^^^^ T'''^ "'^"^o'V- "n^e region 
.ogictodetem.ine;^;ra^;:,S?^^^^^^^ b"«^r fill 

5.6.6 Word Region Address Allocation 

s^rn^^sl Slfds^'fn'gS?^^^ ^^'^ ^ "^n^S-- 

of vvortis in main memo^: Ko^^vef iJ^^erTr^^^^ '° a contiguous sequence 

translated image it must be oosSe to mfJ °^ ^^ds in tlie 

become highly' fragmented dSfto the P^eni S .f P"^. "^^ '"'"^S^ ^" 

intemiixed with code. presence of address links and read-only data 

The following diagram illustrates an example region allocated to part of the image: 
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InstructioflWR Translation 
Allocation of Region In Image 

-Region Control Word 




Address Links 



Data Section 



The mappings from the region words to locations in the address map are shown. The 
address map is fragmented with two address links and a block of three data words. The 
region words are flowed around these restrictions in order to make highly effective use of 
the space. 

The region control word fields show the layout of the words in memory. The first field 
holds 2 and indicates the number of contiguous words before the first obstruction. The 
second field holds 1 and shows the number of words that must be skipped over. During 
the main memory to instnjction buffer transfer this value is used to update the address 
pointer to skip over the required number of words. The third field holds 4 and shows the 
length of the next block of contiguous words from the region. The remaining fields 
continue to alternate between included and excluded blocks of words. If any field has a 
value of 0 then that indicates the end of the region has been reached. 



5.6.7 Executable Map Allocation 

The translated code, in the tomx of regions, must be allocated to the memory map. 
Address links, data sections and originally code that are not being translated needs to be 
preserved. 



The following diagram illustrates an example allocation: 
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muM ateo remain at the same loS<^ Sste.JJS^^)' ""T^ ^ 

5.7 Address Unking 

5.7.1 Function Call Address Link 

'^'^'^nT^Z^^^^^^I^^r'^'^^ .ha. wou,. ^ 
normally possible because in gSS fr,! Siin!. ? ^^^^^ translated version. This is not 
being called are located at a SiffeSd^L^^^e ^^^^^ "^"^^on 
IS loaded into the link register bTa iinnS^n!? f?^^^'^^''. "^^9®- "^^"^ address 
is architecturally visible The Hnk S^iJter^s nm.^^^^^ ''"^Q^ this value 

function makes any further Jb I^^uS.b^^Z L'^ ^""^ ^ ^ ^"ee 

to generate a stack trace back anTlSn f ,? P^^erved link values In order 
Thus to maintain compatbl.«y^J^^b^^°^ ^ MSruSLe^-usS. 
The function call address link mechanism is illustrated below: 
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Translated Code Image 



Translated 

Callee 
Function 



callee 



load link 



return 



Load original 

follow instr 

address into 
link register 



call callee 




execution returns anerL call nSet^nste^^^^ ^^^"-^^^ ^ ^"'ch 

. .nstrucuon is used to hold an a(^Sr^sS 2^ ad^?.«<^^-T^ location of the following 
pointing to the address of a regfonTcTnsfet^ r?f n L""" ^"^^^ ^ ^2 bit data word 
strand to execute from that ZgZ In ftfe Sse S?^' """^^^^ °f the fir^t 

oiiowing the call site in the translated cSe JSdri^^rT ""\P°^"ts to the operation 
.nstruction address. Translated (Sde i?L2 a^^^^^^^^ ^. ^ *^e follow 

translated version of the caliinn xf^^ "° ^^e address link as required The 

Of the follow insSu%!^^^^^^Z^S'^ ""^ -agister with the addrSs 

return InstrucHon (essenlSran ti^r^S^^^^^^ ^'^^ lunction thi 

indirect memory load andS^ anhdSS ^hSlin' ^"^^^^^ ^ an 

JnK and .en branches to that regior^n^? '^n'ZT L^sl^S 

mafnS:fd'::^h"de^^^^^^^^ ^Se^°^^ ^ ^" ~i,ity 

explicitly load the link register w^ ar^ ^^^T' "^^^ ^® requirement t^ 

indirect load at the point of^XtJSreSiriS^ ^ "^"^'^^^ ^^^"^ ^ ^" extra 

S.7.2 Indirect Call Address Unk 

o'^gVaT^^uL^S,^^^^^^^^^^ calls to be made using «.e 

supported in most high tevef Snqiqi l^cU^^^^^ ^"^ e>^P«c™y 

implemented as indirect functionr^f ^fi?" r 2 ^" ^"^"^l functions are effectively 
generated virtual tS:'! S^^^aUs ^"^^^^^^^^ '^^'^l" ^ ^P^^'^l compiS 

functions. associated with the class object using the virtual 

The diagram below illustrates how the mechanism worths: 
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Orighal Code Image 



Instruction set Translation 
Translated Code Image 



Original callee add-ess 
maintained 




eeaXee' 



A data tab e somewhere in the code Image provides the address of the function callee At 

SL^!,^ °? u '^-S"®* ° ^" '"^"■^'^^ ^" "Sing data that has been 

loaded from the table. The function address does not have to be loaded from the table 
dj-ectly before fte Indirect call. It may be taaded Into a data stmcture some to before 
Thus in general rt is not possible to detemnine what set of functions any given indirBS cSi 

Zf^ "^.t- 7^ ^"^'y^'^ '^"^^ ^^^"'^^ that any indirect (ill^n reach 
addressed function anywhere In the code image. ^ 

lr«mri!!!?'l!^ "^f^ 'T^t ^^^'■^^ ^ata item is maintained in exactly the 

same state. It points to the function address in the original code. At the original functfon 

fori^S Sl?'':^- ^ '"i^^? "^"^ P'^^^- ™^ «^"t^'"s the address of S t^ansS 
IT J^® translated code must flow around this data item at a fixld 

nf2?™.d^f S;'' 1^" ^"^"^^ ^" ^ 'ndJeS toad is 

fn fhi^ ^® "^"tents of the address link. An indirect call is then made 

to the destination pointed to by the link. Thus all indirect function calls are made doublv 
indirect in order to reach the translated form of the function. ^ 

In general the code image is analysed and address links only placed at the start of 

E°rn Thh "^'"S addressed. However, a more Lplfetic ^alj^fe can 

h?v? r ? if h'T" ? ?! ^^"^ °^ ^" "^'^ "^^^"s that the analysis does ?ot 

have to relrably detert all data items representing a function address. However, if Ws 
policy IS utilised tiien it must be possible to detect the start address of all lunctons by 
oti^er means (perhaps by assuming tiiey always have a symbol and a preceded by return 
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6 Function List Creation 

6-1 Requirement 

aXe r SnvTSTn^^^^^ fT^^^^- ^ code v^hln 

small granularity to allow thrp^eiStiS^ t^^^sSS ^ ^^'^ ^ 'datively 

MetaMapper to the next The p^S^So Sn^S?^^^^ «ie 
any size will be significant This de^S« .w a" re executable image of 

and debug cycle of so^ re^L^^S SmS?^f °°^P"- 
that only modules In which code hp^^Trn;]^ compiler tools use a makefile system so 

oene^tedf^rotherm^SS^srsLllyta^^^ 

Sv^^^iainrthe »S r. ^el»^^ ^ ^^'^^^ ^n<*-ons «.at 
functions are changed on each deveSmem^^^^^ °' ^ ""'""er of 

6.2 Reference Counting 

mfhlra ^untT«?elmT^^^^^^ ^ ^ -^rence count table 

of the code. The teSe Is S by frL^^rli^^^^^^^^ to from other parS 

map Is created with one entry for each v!i?dt, ? -^"^'^ executable. A linear 

reference-count of the number of tac^ftiat^L^I^^^^^^ "^-^T ^^""^ ^^"^ ^^'^^ ^ 
S»on"" ^ "-^-"^ -"-SnM^^^^^ 

dls^nat^ Vln?^^^^^^^ reference count tor «s 

count of its destination is LXd U? mS^iLm ^n^^^^ ^ "^^^^^"^ 

that are actually called defin.e.y fom, L^S^oTS^^^^ 

start of a lunction In flie ar^llS^I^ I maximum count to ensure that it forms the 
development tools. Si Tddress^ 5 W«onf ^ecutables by the ARM software 
•mis allows such addtslgTS SiaCilcSd. " ^"^^ 

uSU'iT/h^^^^^^ -tained fixup lnfom,atlon. 

final executable A funcfon ac^dr^ha hi S! k ^''T '."formation to be retained in the 
Of a code rather than Sate ^cA?^^^^ *® charactenstic of being relative to the base 

J^lrro^onTs'to^sr^e '^^"9 ftinc«on add^ssing a 

addresses with an assS sj^boi ^ n mallS^iT^'^^^^,'^*^'"^ ^" code 
analysis must be careful to disced ^^iSL ^iT f^ ^ <^"ction. This 

labels within functions. "'"'P"®' ^^^"^^ symbols that represent internal 

63 Used Word Map 

^dtt Se^f-^allt^"* r ™^ ^ ^ 

wo^rsKrti-Se£^ 

.elng used. Some -Kt^^J L^^^^^^^ 
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immediate parameters) also have to be preserved in the executable Image. They are also 
marked as b^ng used. ' 

6^ Function List Creation 

The function fist is created using the reference count table. Entries that are marked with 
the maximum count definitely represent the start of functions. The function creation works 
by initially assuming that only entries martced with the maximum count are function entries 
and gradually splitting functions into smaller blocks If that assumption proves to be wrong. 

A candidate function is delimited by two maximum count entries In the table If all the 
control flow witiain the function is internal then the candidate function can be added to the 
ftinction list The code in the candidate function is traversed and this time branch 
destination reference counts are reduced. If a branch is encountered and the destination 
IS within the candidate function then the reference count for the destination is reduced by 
one. If, at the end of the candidate function traversal, all the reference counts within the 
function are zero then all control flow is internal and there are no branches into the middle 
of the function. If there are non-zero counts then there is complex control flow resulting in 
such branches. In that case the candidate function is split into a number of smaller 
functions partittoned arourid the non-zero counts. 

6^ Translation Caching 

The time to translate the whole of a sizeable executable Image would be prohibitive. Thus 
a function level code caching mechanism is used from one invocation of MetaMapper to 
the next. This dramatically reduces the translation, time for the majority of edit-compile- 
debug cycles where only a small proportipn of the executable Image changes. If the 
architecture of the target CriticaiBlue processor changes then all code Is Invalid and the 
entire cache is cleared. 

Before a new translation is made for a particular function a check is made to see If it Is 
present in the translation cache. If so. and the original function code has not changed 
tfien there is no requirement to translate it again. The associated translated form of the 
function IS also stored in the cache and can be simply copied to the executable image. 

The cache entry is identifier by the symbols associated with its entry address. In most 
cases this is simply the functions name. Functions that do not have an associated entry 
symbol cannot be cached. However, such functions are unusual. An identifier is used 
because association witti a particular address would not be practical, as functions move 
around as other functions in the executable are modified. 

The binary image of the original ftjncton, used for comparison purposes, must also be 
invanant to these address changes. If a function includes a call to another function as is 
commonly flie case, then the change of destination addresses causes ttie binary to 
change. However, this should not causes the calling, function to be re-translated as its 
actual implementation has not changed. The cached version Includes structured data that 
points to all immediate fields hoWIng addresses. These are ignoiBd in the binary 
companson process. Instead tiie symbolic destination Is compared. As long as the same 
function name is being called ttien tiie functfon is considered to be unchanged. 
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7 Instruction Translation 

7.1 Register Representation 

A CriHcalBlue processor has a central register file that is used to hold values that are 
written to registers in the host code. There is largely a one-to-one correspondence 
bewveen these roisters and those present in tiie host architecture. 

For the nnain execution code, only those register values that are live at the conclusion of a 
strand need to be stored into the corresponding register. A register is live if it might he 
subsequently read in the program. Temporary uses of registers within a strand do not 
have to be reproduced in the register file. Thus the amount of register file traffic is very 
significantly reduced in comparison to the host architecture. This allows the register file to 
be synthesized alongside other execution units. It does not require special status or a 
large multiplicity of access ports. If necessary the logical registers may be split over a 
number of separate register file units to increase the amount of access bandwidth. 

7.1.1 Main Registers 

There main registers are directly equivalent to those present in the host architecture The 
majonty of FUSC host architectures have either 16 or 32 registere of 32 bit width. The 
same number of registers is present in the CriticalBlue register file. 

Some architectures provide a special meaning to register 0, that is guaranteed to read as 
0 and form a data sinl< for wrrtes. In the CriticalBlue architecture, register 0 has no special 
meaning. Zero values are produced using an immediate unit and the effect of data write 
smiting is produced by simply not using the result fomi an operation. 

A register must be provided that is used to hold PC values pointing to the original 
'rS[!^f?°"^" register may be part of the main set of registers (as is the case with 
ARiy47). The PC register is effectively live after each instmction. However, the main 
translated code does not l<eep this register up to date. Any read ft-om the PC register is 
handled as a special case during the translation process. The shadow code keeps the PC 
register consistently up to date with the conrect value from the host code after the 
completion of each host instrocBon. 

7.1.2 Condition Registers 

CriticalBlue has separate registers for each of the individual condition flags.' In other 
architectures these are usually combined as bit significant values into a single register 
CnticalBlue uses separate registers because not all flags are updated by particular 
onginal tnstnjctions. CriticalBlue can emulate that behaviour by only updating the 
appropnate condition registers if the flags are live on exit fl^m a strand. 

The flag values are available as single bit output results lirom the arithmetic and logic 
units. If the condition code-flag is not live then these output results can be read directly If 
the associated flag is live then the result must be read and written to the appropriate 
condition code register. The connectivity between the result port and register file sign 
extends the result to 32 bits, so if the flag is set then all bits are 1 and if reset then all bits 
are 0. 

Some condition code evaluations may require the reading of one or more flags and the 
logical comblnatfon of them in order to determine the condition value. If a read or write 
condition code register instruction needs to be translated fi-om the host architecture then 
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The individual condition code registers are as follows: 

a Carry Register: Set If an operation resulted in a carry output 

□ Overflow Register: Set If an operation resulted In ah arithmetic overflow. 

° sl^aStr ' " ^ -ult (most 

Q Zero Register: Set If an operation resulted In a 0 result. 

7.1^ Special Registers 

^iste^" ulg'^^^^^ r ''^^^^^ 01 other 

registers. In thi iSlSrtecS^^?^ l^^^v^^^^^^ "^^^^ ^ implemented as banked 

code on enty to and S^fr^m fte ^om^^^ft^l^^^f^^' ^ ^^'^ ^ generate special 
into addition special r^steS ^P'^P"^*® ^^^^^^ to preserve the required registers 

^^^^ ,Tn ^LT^^':^^^^^^ -C.C. If .e host 
The following Is a lisl of the special registers required: 

Ha„.ie?co*.e„tr^(L^rrp^tus"?;:^'r.=SS^ir 
80 tife previous stack must be pre^J^eS! ^ 
° a S.^SSS.'^'*'^' Used to presage the state Of tl,e « register duHng 
° SS-coS^^-Su^LISa^^,-^^^ .he state Of the f^r 

° IS^'oTS^^rtorha'S^'^.t^'^'S?- P-^-e all volatile 
preserved usirtg'lh? no^noTog^ sS^^nl'^T^"'* =^ 
sliould also Include realsters «nr hrfrfrn„ , ^ ^ Preservation set 
include the stack re^nS Is^ JJ5? '~;«'™««°" """es. It shoukf also 
handier. Thb b 3£l ^ ™" I* «>r the 

creatk,n/dSon'vSine^p,S''SSL™^;iv'^^ «ng stac* «b„« 
registers IS required, as nested interruXS^pS,^^ ^ presenratlon 
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7J2 ARM Instruction Translation 

based Sn the mechanfms used tr^n^a£T^ iS.^'^'^f^^- "^^ ^^^^^Ptons are 

techniques are ,^eIyapp?c^L^o''R^^^^^^^ ''"^ "-^t of *.e 

7J2.1 Branches 

until then end of the reaion SrTSn? ^ * ''"^ ^ *''^"ch ddes not occur 

occurs at the end .5 Se r^i^ fte^"ir.ffi'r?i;^ ^ ''"^'^ '^^"^^ 

Theie cannot be any opLl^ns in a st^nS ^ 1^ °^ strand, 

after issuing a b^ncli sSrclS<5; InsSonTd^LTb"^^^ ^^^^ --P'^*-* 

because they Tre no SyZchS If an^^ ^^is is 

rBstnctiononthelssulng^^er'jfr^resL^^^^^^^ *-P-- a 




Strand including 
Brand! Operation 



/"Subsequent Strand 
^ including Branch 



aTraTchTn T^'^mf oSon h""'^'- " ^'"^ ^ -^^-^"-t strand mat 
operat-on/Tliis is^uL^e^riK°^^^^^^ " 'T^^f ^^^^^h 
in its speculative phase ff Z SiZ bran^ T^l^L'^f "^l"^ ^^"^ still 

not enter te committed phase iStft^at^^e b^^^ 

imposed order so can aSually be llsu^S e^San Se'i^Jfb^nS"' " '^^^ ^" 

S Str^e'^dS^orSd^^^^^^ ^ an Immediate 

destination. A wide immedi^e un»?f«S mo^^ I *° executed at the 
branch Is being fesuX^Sj^ f ° f ""•^b^'" ^^at the 

being written and the eS addresf hfJ SLn L * -'^^ ^L"P '"""^'y ^ 

branch Itself. The ImmS value? ^'^^"^ ^P^'^^o" ^he 

to the branch contrS uS *° ^^"^^ ^^lue is then passed 
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ITi^^ translated uses the same condition as that for the current strand th»„ 

?r^nJ"^^feirnsZ"u„"Sndr„*,r ^h"^-^' "^'^^^'^^ 
However, If an J^Mdr^lt^Sor^S^llSrjf^^lSJ^^ *l -™ ^>«nd. 



Read \ Earlier Strand 
Condtion 1 n including Condition 
\ Evaluation 
\ 
\ 

\ 




▼ To Fall Through Strand 



^/-^ Stub Braidi Strand 



7JZJ2 Hardware Function Calls 
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Slame''"' "^^^^"^ ^ ^^"^'-^ - fixed locations of me callers 

. UtJ paS? Se'Sa? from tcrSeT"^^^^ ^^^-P-^^ -^ister. 

parameters are then oa^^ ^ nn^n^J^ l ^ "^©se toaded 

maybemarkl^iSTeKrlcon'fig^^^^ P^^^^eters 
passed. Other parameterVrav rSr?L ^fcai^ed. These parameters are not 
pointers (or refeiWS^mete,^^ to hniH rf f parameters corresponding to 

passed as operands; '^^""s '^om the function. Again, these are not 

M^ed . OOP, .^.^.^.T«^rpK.\^tt- 

optimisations pe*rmed on ^C^^ PJ^t^^T°"y'r^'^- ^ 
parameter reglstere. This allows mm^kS^ 1,?'" "'<'.^"bs«I"ent read of 
routes them direouy to the opera^b^^^t Parameters and 

7A3 Software Fimetion Calls 

the^r^^S^r^j^^^^^^^^ ^Su^mTdTre^ t^^X^^^^^^ pHor to 

..nk register is implic^ysetfrom the n^pS^tue^t^^^^^^^ 

lep?Sernf^2a^reTi^^^^^^^ 'T^^^^ ^" 

is the return location and Doims to an SSlI J 2® Program. This 

value is Written to SiTnk ^L^^riSrlf a^Sra^^^^^^^ "^^^d'-^te 
of the destination address and fiS sh^nT fol£Sl?L^^^^ 's .mplemented as a load 
destination address is vvritt^ to *e Se imi^^^^^^ ^ operation. The actual 

address is resolved. ""^3® ^ ^ ^^up when the destination 

Software Interrupts 

field that is rSad by the Sara ^^^^^1^^^^^^^^^ "'"^^'"^ ^" immediate 

Entry to a softvJe inte^f X S^?e^^^^^ ^ ^^ain type of operation. 

inter^pts are usually used to ttexlLS^^^^^^ 

be ve'S to me^'^^^^^^^^^ -taction causes execution to 

interrupt causes a Iftf^ZSe to me^'^la ^.^S^^^^^^^ ^ ~ 

necessary a symbol of that name can be inse^ed at^e starTo^^^^^^^^^ " 
the software intemjpt in an existina ob\e^m^ A^fJr o ^^^"^°"*"^P^^"^^'^9 
branch inst.c.-onthSvect:;rc:Son1?L^Lr^^^^^ ™ « 

l%"^^itinal^t^Z£' tot '-<^^ -to 

ve unK register. A branch to the software Inteniipt handler is then 
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^ -«-re interrupt 

main link register ^ ~P'®^ alternative Dnk register to the 

72.5 Data Processing Insfnictions 

o'S^.rsr^^^Sn^S^^ Of i^e 

destination register. Some ODeratiorf<i^rh f?™ ^ * specified along with a 

a write-back to a register aS,!^^^^ compares and tests) do not actually cause 
operand allowing ^mSate re^Ser or ^^^^^ T ^^^"^^'^ hand 

mayoptionaUy AetX £nd£^^^^ *° instructions 

J^mt's^o^Kra,^^^^^ -P"-^ used to support 



1 OPERATION 
AND 


DESCRIPTION 


SUPPORTING 
UNIT 


SUPPORTINGl 
METHOD 


EOR ~ 


Logical bit-wise AND of two 32 bit 
values 


Logical 


" AND 




Logical exclusive-or of two 32 bit 
values 


Logical 


Exclusive-OR 


SUB 


Subtraction of two 32 bit values. 


. Miiuirneiic 


Subtract 


RSB 


Subtraction of two 32 bit values with 
operands reversed. 


Arithmetic 


Subtract 


ADD 


Addition of two 32 bits values. 


Adder if no 
flags set else 
Arithmetic 


Add 


ADC 


Addition with earn/ nf fun-k rn k» 
values. 


Arithmetic 


Add witii Carry 


SBC 


Subtraction vwth carry of two 32 bit 
values. 


Aritiimetic 


Subtract with 
carry 


RSC 


Subtraction mth carry of two 32 bit 
values with operands reversed. 


Aritiimetic 


Subtract with 
carry 


TST 


Logical blt-^se AND of two 32 bit 
values but vwthout write-back of 
results - flags used only. 


Logical 


AND 


TEQ 

CMP 


Logical exclusive-OR of two 32 bit " 
values but vwthout write-back of 
results - flags used only. 


Logical 


Exclusive-OR 




Subtraction of two 32 bit values but " 
without write-back of results - flags 
used only. 


Arithmetic 


Subtract 
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CMN 


Addition of two 32 bit values hut 
without write-back of results - flags 
used only. 


Arithmetic 


Add 


\jT\r\ 


Logical bit-vwse OR of two ^ bit 
values. 


Logical 


OR 


MOV 


Move operation. Operand loaded 
and written to destination. 


N/A 


N/A 


BIG 


Bit dear operation inriDlemented 
using NOT of operand and then bit- 
vwse logical AND 


L.oyiCal 


NOT + AND 


MVN 


Inverted move operation. 


Logical 


NOT 



The individual instructions are translated into a number of separate ooerations on th*» 

u?S?«nH hV^^'^k"^. °f is dependent upon the addressing modi 

Sf? ^1 i t^""^^^. the fo"°win9 sub-sections. An operation is then SneratS^ to 
read the left hand register operand. TTiis is followed by the transited Ste n-SSSn^ 
operatton feeff. The majority of instructions map to a single datfpSSim^SfS 
fo ^tlT^ ^ ^^^^^ "^^P *° ^ opereHonsJf requirld then an^STgeSret^ 
S,dTtS?n'lle^'on^^r^ '"^^ upda£";'e'i"nd1^ 

Si^rT °Pe^f°"s are generated to update the affected condition code 

I!^n.n2Sf r*~?°" 5 ^"!'3ted into a number of individual operations. However 

iJ^nS^rl!*® >f^?° u *° °^ PC '^Sister (as the left hand operand) is 

handled speaally. Such an operation is generally used to ralculate the addLs of ^ dat^ 

SulSJ '"^^Pf^^"t -n^e full immediarva^e a4Tldlon fe 

calculated and then a single operation is generated to load it via an immediate utT The 

t°rSnS "r^^^ ^."^'"9 ^° ^<ldress can cha^ XtS Sc"ng a^ 

translation of the whole function containing the Instruction. • 

The handling of addressing modes is described below: 

7.2.5.1 Immediate Addressing 

Ln^'^hS^^ ""S ^F^^^fJ"^ °f ^'^""^ ^P®^*^^ ^ an immediate value. This is divided into 
an 8 bit immediate field and a 4 bit immediate shift field. The shifted value is cSated 
and an immediate unit of appropriate width selected to load the value ?he rarowesl 
possible immediate unit is utilised in order to maximize code density. narrowest 

iJT^nh?^""^ rapable Of toading lull 32 bit values may not necessarily be available in 
r?n^ ?^?"i!;.^.f «ne immediate value generated after the shift may be wIdS- San 

?7oS.H^''^ '"^J." ^ ^*^P P^°^s basic 8 bitZr^S 

IS loaded via an immediate unit. The availability of an immediate unit of at least 8 Wte 
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ronranltp^K^f"^^^' V't^^^ ""'ts ^ tt'en used to shift the value 

appropnately to generate the required Immediate value. 

7.2.5.2 Register Shift by Immediate Value 

m^^IltSftHH "^1"® ^" immediate amount. If the shift amount Is zero 

« StS^ - ar. « s^eTal^ SS^S^^ 

7.2.5.3 Register Shift by Register Value 

^nJ^s^ t'*'^ " '^'^^ mw^l value and sK 

The shift units have special methods allowing them to extract the reauiiS shS amount 

7A6 Multiply Instructions 
7.2.7 Memory Access Instractions 

^io^a^^.Tr Pra and post hdaxing of addressing nSdes 

ESrHlf r ^^^^ 
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If a full 32 bit memory access is being perfomied then the access itself only requires a 
single operation on the memory unit Subword accesses are supported but require 
additional operations as the shifting is perfomned externally to the memory unit. 

In the case of a load the shifting is performed after the access. The byte shifter is supplied 
with the operands of the loaded data and the loaded address. The byte shifter uses the 
lower 2 bits of the loaded address (which are ignored by the memory unit) to shift the data 
Item appropriately. Different shifter methods are used for byte and half word data and for 
signed/unsigned variants. The data is moved to the least significant bytes of the result. 

In the case of a store the shifting is perfomned before the access. The byte shifter is 
supplied with the operands of the data to be stored and the store address. The byte 
shifter uses the lower two bits of the store address (which are ignored by the memory unit) 
to shift the data to the correct position in the 32 bit word for writing. Bytes that are not 
being written will be undefined. Different shifter methods are used for byte and half word 
writes. The data size is also specified to the memory unit. Only the required bytes from 
the 32 bit word are written to memoiy. By explicitly shifting the data before the store, the 
memory unit does not need to have any internal multplexers to direct data to the co'nect 
byte positions. 

Once a memory access has been added to the CDFG. checl<s are made vwth previous 
memory accesses. [Dependencies are added as required to ensure that the access is not 
scheduled earfler than aliased stores without appropriate compensation mechanisms 
being in place. 

This following section describes the mechanisms used to generate code for an 
addressing mode. The number of addressing modes and their complexity is dependent 
upon the host architecture diosen. In general, FRISC architectures have a rather limited 
range of addressing modes. This description is largely based on those available for the 
ARM7 architecture, which supports more modes than most comparative processor 
architectures. 

7.2.7.1 Zero Immediate Offeet 

If the offeet is 0 then code is simply generated to read the value of the base register. 

7.2.7.2 Non-Zero Immediate Offeet 

In this case translated code is generated to loiad the offset using an immediate unit, load 
the value of ttie base register and then add the two values using an addition unit. 

7.2.7.3 PC Base Register with Immediate OfFset 

Addressing using the PC register as a base is commonly used to access data sections via 
a position independent means. The value of PC at the point of execution is detennined 
and the offset added to that value. This gives the address that will be generated by the 
addressing. This is loaded using an Immediate unit wide enough to handle addresses. 

7.2.7.4 Register Offsets 

In this case we have two register offsets without any scaling applied to the right hand 
operand. We generate code to load the values of the registers on the left and right hand 
side. An addition operation is then generated to add the two values togettier. 

In the case of ARM7 it is also possible to use a subtractive fonm of addressing where frie 
right hand side is subtracted from the left hand side base address. In this case a 
subtraction rather than addition operation Is generated. 
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7.2.7.5 Scaled Offsets 

In this case a roister is used as Ihe right hand ^de offset but is being scaled by an 
arbitrary shift value. In some architectures the range of shifts available is limited. The right 
hand rBgister is read and the appropriate shifts are applied. The CriticalBlue architecture 
has separate byte and bit shift units. One or both of these units is used depending upon 
the shift value that needs to be applied. In some cases the cany output from the shift may 
have to be presen/ed. 

7^8 Block Memory Instructions 

Tine block memory instmctlons allow multiple words to be loaded or stored to memory 
with a single host instruction. The behaviour of this instmction in the ARM architecture is 
unusual in that it does not confomi to the general principles of RISC Instmction 
implementation. It takes a variable number of clock cycles to racecute depending upon the 
number of registers that need to be stored or loaded. The multiple word access 
Instructions are commonly used in function prologues and epilogues to save and restore 
volatile registers on the stack fi^me. The ability to do this with a single instruction 
significantly improves the code density for the ARM architecture. 

These block memory instaictions are translated into multiple operations in the CriticalBlue 
architecture. The base register is read and then for each individual access (as determined 
by the register list in frie host instruction) a memory operation is generated. An individual 
addition to the base address, using an immediate offset, is generated Ibr each access. 
Iridividual offsets are generated rather than continually incrementing/decrementing a 
single address value. This allows the order of the loads/stores to be changed by the 
scheduler. This improved scheduling freedom allows Oie memory accesses to be more 
easily issued in parallel vwth other operations in a function. 

Each individual memory operation is analysed for potential aliasing with preceding store 
operations. Dependencies are added to the CDFG as required to ensure that the correct 
operation ordering is maintained. 

If the write-back bit is set in the instruction then an updated ttase register value is written. 
This is achieved by adding an immediate offeet to the original base register value and 
then writing the result back to the same register. 

7.2.9 Memory Swap Instructions 

Swap instaictions combine the load and store of a byte of 32 bit word in a single atomic 
operation. This is used to implement semaphores where an automatic memory operation 
is required. CriticalBlue does not support automatic swap operations. However, other than 
atomicity the semantics of the swap instmctioh are translated. It resulte in both a load and 
store operation. Appropriate dependencies are generated to any potentially aliased 
preceding store operations. 

7.2.10 CPSR Access Instructions 

The Cunent Program Status Register (CPSR) holds the state of condition code bits along 
with various other bits concerning the status of an ARM processor. Only the condition 
code bits are supported in the translated form of the code. Other bits within the CPSR are 
ignored. 

The CriticalBlue architecture implements the individual condition code bits in separate 
registers to allow their independent read and update. This is done because certain 
instructions only update a subset of the condition code bits. It is only when explicit 
accesses to the CPSR are perfomied that the individual condition code bits need to be 
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formed into a single bit set in order to provide exact emulation of the architecture 
Fortunately such explicit CPSR accesses are rare. 

The sequence of operations generated is dependent on the type of CPSR access beina 
performed, ** 

7.2.10.1 CPSR Read 

A CPSR read obtains the state of the CPSR register and writes it to a general purpose 
register. A CPSR read might be perfomied if the register needs to be presented for any 
reason. A CPSR read is converted into a sequence of individual operations In the 
translated code. Rrstly, a default CPSR value with all condition code bits reset is loaded 
usmg an immediate unit. The other bits in the register are assigned to reflect the standard 
user execution state. Each of the condition code registers for cany, negative, overflow 
and zero are then copied into the appropriate bit position in the value. This is simply done 
by masked off the appropriate bit from the condition code register (all bits in the registers 
are set to the value of the flag) and then ORing the value into the CPSR state being 
constructed. The final CPSR value can then be written to the appropriate general purpose 
roQisiGr. 

7.2.10.2 CPSR immediate Write 

A CPSR Immediate write sets the condition code bits (and other bits within the CPSR) to 
specific values. Writes to the other bits are ignored but the writes to the condition code 
bits are translated into appropriate writes to the individual condition code registers The 
Immediate value is calculated. Operations are then generated to load a 1 bit immediate 
value for each condition code bit and write it to the appropriate register. The single 
imrnediate bit is sign extended to the full width of the register. 

7.2.10.3 CPSR Register Write 

A CPSR register write sets the condition code bits (and other bite within the CPSR) to 
values from a register operand; Writes to the other bite are ignored but the writes to the 
conditon code bite are translated into appropriate writes to the individual condition code 
registers. Each of the individual condition code bite are masked off using an immediate 
value loading with an immediate unit The zero result from the logical unit Is then copied to 
the appropriate condition code register to set ite new state. 

7.2.11 Unsupported Instructions 

Certain ARM instmcttons cannot be translated to CriticalBlue architecture equivalente An 
en-or is given If such an instruction is encountered during the translation process An 
option to the MeteMapper allows the error to be downgraded to a waming and no code 
generated for the Instruction (i.e. the Instruction Is Ignored). 

All Instructions In the unused Instruction space are unsupported along with all 
coprocessor access Instmctions. 



7J3 Special Code Insertion 

In some cases additional code is inserted In the translation that was not present in the 
onginal host code. This is to support the use of standard functions for use as handlers 
wrthout the need to resort to assembly language tevel coding. Additional instructions are 
inserted to save and restore additional registers. In the case of code insertton at the main 
entry point of the program, code is Inserted to setup the intemjpt handler vectors for the 
architecture. 
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7.3.1 Main Entey Point 

Coe must be inserted to setup the interrupt vectors for the system. The inserted code is 
positioned before the translation of the very first host instruction at the entry point. The 
entry point itself may be denoted by two different means. If a symbol hw_main is defined 
then the function starting at that address is considered to be the main entry point Thus 
the "main" function of the program can be declared as hw_main and any other entry code 
nomnally executed before the entry point is ignored. In the absence of such a symbol the 
first executable word in the code image is considered to the entry point This is normally 
situated at address 0. 

The first operation issued is a disable intenupt operation to prevent intemjpts being 
handled while the vectors are being changed. When the processor is first powered up 
intenupts are disabled anyway so there is no possibility of an intenxipt occurring prior to 
the disable operation. A clear instruction buffer operation is also issued. This clears the 
contents of frie instruction buffer including any locked down entries. This is required so 
that the lock down mechanism worths oon^cUy even if a branch is made to the entry point 
rather than execution after power up when the instmctlon buffer is empty anyway. 

If the processor has a cached instruction buffer then all fast interrupt handlers must then 
be locked down into the buffer. This is achieved by calling each of the regions that are to 
be locked down in a special mode. A list is created of all the regions contained in the 
functions that are intenxipt handlers. Each of these regions has a spedal strand that just 
perfomis a return without executing any of the operations in the region. This is always the 
last strand in ttie region. The branch to the region only executes the last strand. A special 
calling method is used that allocates the region being called to the start of the instruction 
buffer after any previously kx^ked down entries. The calling region (containing the entry 
code) may be overwritten but it can be reloaded on retum from the called region. 

Once any fast intemjpt handlers have been locked down the handler addresses can be 
set for both fast and slow intenupts. Each of the handler addresses is loaded using an 
immediate unit. The addresses are derived from the entry addresses of the translated 
handler functions. Each handler function has a special symbol name. Special branch 
methods are then used to set the vectors into the correct registers vwthin the branch 
. control unit 

Once all handler entry addresses have been loaded the intenrupts are enabled. If the 
entry point was via the hw_main symbol then an operation Is issued to load the initial 
value of the stack pointer. The address is specified as a command line option to the 
MetaMapper. The translated code for the first actual host instruction at the entry point is 
then executed. 

7.3.2 Hardware interrupt Handlers 

Additional code is inserted at the start of the prologue to save the volatile registers into 
specially allocated registers in the register file. These registers are not nonnally saved by 
the function. However, for an intenrupt ftjnction they need to be presented as the interrupt 
can occur at any point during the execution of a fljnction. The stack pointer register and 
the condition code registers are also preserved. The stack pointer is loaded with a new 
immediate value that is specified as a command line parameter to MetaMapper. The 
cun-ent stack cannot be used as the code may be manipulating the stack position at the 
point of the interrupt and there may be valid that below the stack pointer that would be 
ovenA/ritten. Ail the code inserted at the prologue is associated with the very first host 
instruction from the function. 
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Code to restore the preserved registers is inserted at each return point from the function. 
A single tunction may have a number of possible return points. The restore code is a 
minror Image of the presentation cx)de and is associated vwth the instruction that modifies 
the PC register to cause the function return. 

7.3.3 Software Interrupt Handler 

Additional code is inserted at the start of ttie prologue to presen/e a subset of registers 
into additional registers in frie register file. This is to emulate the effect of entering a 
software intemjpt handler on the ARM architecture where some registers are 
automatically preserved into another register file bank. Both the stack pointer and CPSR 
registers are preserved. Operations are inserted to copy the stack pointer and the 
individual condition code registers to special registers in the register file. The alternative 
link register (which holds the return address after the software intentipt) is then copied to 
the main link register. All the code inserted at the prologue is associated with the very first 
host instinction from the function. 

Code to restore the preserved registers is inserted at each retum point from the function. 
A single function may have a number of possible retum points. The restore code is a 
min-or image of the preservation code and is associated with the instruction that modifies 
the PC register to cause tiie function return. 

7J3A Monitor Handler 

Additional code is inserted at tiie start of tine prologue to save the volatile registers into 
specially allocated registers in tiie register file. These registers are not normally saved by 
the fljnction. However, for a monitor fljnction tiiey need to be preserved as tiiey may hold 
live values at the stopped point of program execution. The slack pointer register is also 
preserved. The stack pointer is loaded witii a new immediate value tiiat is specified as a 
command line parameter to MetaMapper. The cunnent stack cannot be used as tiie code 
may be manipulating tiie stack position at ihe point of entry to tiie monitor and there may 
be valid tiiat below tiie stack pointer that would be ovenMitlen. All the code inserted at the 
prologue is associated with the very first host instruction from the function. 

Code to restore the presen/ed registers is inserted at each retum point from tiie function. 
A single function may have a number of possible retum points. The restore code is a 
minror image of the presen/ation code and is associated with the instnjction thai modifies 
the PC register to cause the fijnction retum. 

8 Debug Environment 

8.1 Overview 

Before an application is ever run on real hardware it will have been tested in tiie 
CriticalBlue simulation environment. This allows fijil cycle and bit accurate testing. 
Stimulus and behavioural modelling code will be produced to emulate tiie physical 
environment that tiie application will be executed within. This process will allow tine 
discovery of most major bugs in tiie application. Since \he simulation mns natively using a 
C++ environment, the engineer is able to his or her favourite debugger and integrated 
development environment. 

Of course, there are always likely to be application level bugs ttiat only manifest 
tinemselves in tiie real hardware environment. To allow easy analysis of these, 
CriticalBlue supports a powerful debug environment. This is created from a combinatiori 
of hardware structures embedded into every CriticalBlue processor and a software 
debugger capable of supporting source level debug. 
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8.2 Debug Cham 

A standard debug port provides, a connection Iretween the CriticalBlue processor and the 
host debugging platform. This will utilise a standard serial data fomnat. The debug port 
connects to a chain of serial connections between ail the control and functional units 
within the processor. A serial protocol is leed since high data speeds are not required 
and there is a need to minimise the area ttiat the debug hardware occupies. 

The debug chain allows individual registers in the processor to be read and modified 
under the control of the debugger. This allows the internal state of the processor to be 
intennogated and modified during the execution of a program. The overall processor dock 
is halted while these operations take place. 

The debug chain allows access to any of the execution control registers in the control 
units. This includes the cun^nt program counter position and the various strand and 
branch control registers. The debug chain also provides access to the register holding the 
cunrent Instruction word being executed. The debugger can modify this register in order to 
inject particular instructions. This is the primary mechanism that allows the debugger to 
perfomri operations that change the state of the machine. For instance the user might type 
in a command in the debugger that modifies the state a memory location. The debugger 
converts it into a store instmction. It is injected Into the instnjdion word and frien executed 
by the processor. 

Crucially, the debug chain also connects every single output register in all the functional 
units. This allows any item of data being held in the networi< to be read and modified. The 
debugger is able to gather the parameter values being passed to a particular functional 
unit for an operation. This allows it to show the actual parameters used for an operation. 
Registers may be modified in order to present particular parameter values to an operation 
injected into the processor pipeline. 

The debug diain provides a powerful mechanism for debugging an application mnning on 
a CriticalBlue processor. Moreover, It does not require a significant area overiiead and 
does not impact the timing of critical data and control flow paths. 

83 Breakpoint Registers 

A CriticalBlue processor also contains a number of breakpoint registers. These cause the 
processor to halt if the breakpoint is reached. They are utilised if breakpoints are set In the 
debugger. A breakpoint register consiste of a region start address and a strand number. 
Execution is halted if the particular strand in the region is ^ecuted. This allows 
breakpoints that halt the machine on the equivalent of a particular source line in the code. 
All later strands are squashed so execution is effectively stopped at the source line, even 
if instructions from later source lines have been issued. 



BA Instruction Predicates 

Instruction predicates are generated as part of the CDFG. They are used to delimit the 
translated code for a particular host Instmction. A single host instmction is translated into 
a number of Individual operations in the CDFG. 

t 

Instaiction predicates are optimised out of tfie CDFG for the main code. However, they 
remain in the shadow code. They provide guards for the execution of operations related to 
a particular host instnjction. This allows the single stepping of original host instructions 
when executing the shadow code. 
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The diagram below shows a segment of a CDFG vwth instruction predicates. This is 
typical of shadow code. 



Operations for host 
iTBtnJctton N-1 ^ 



Suffix for host 
Instruction N-1 

Prefix for host 
Instruction N 



Operations for host , 
InstructtonN^ 



Suffix for host 
instruction N 

Prefix for host 
Instruction N+1 




Control a-cs ensure bnstruction N-1 
operations executsKl before predicate 
guarding yrstruction N 

Instruction predicate 
guards execution of host 
Instruction N 

Control arcs ensure all operations from 
;j^host Instruction N executed after Its gjard 



— ► Control Arc 
—♦'Data Arc 



Ordinary operation 

/?pjj\ Operation always executed 

Independent of predicate status 



Each host instaictlon has a suffix consisting of two operations. Firstly, there is an 
immediate load of the original host instruction address of the following instruction. 
Immediate loads are always executed independently of ttie strand enable status. 
Secondly, the instrucHon address is written to the PC register in the register file. This 
emulates the behaviour of an instruction that sets the PC value of the next instruction to 
be executed, if single stepping of host instructions is being perfomied then this sets the 
conect address for the next instruction to be executed. The PC register write is executed 
in the strand of the instruction it is related to so is only executed if the operations for the 
instruction itself were executed. 

Each host instruction also has a prefix consisting of an instruction predicate operation. 
The PC value loaded as part of the preceding suffix is provided as an operand to the 
instruction predicate unit. It compares the supplied address against breal<point values 
held in intemal registers. This determines whether the SEM should be enabled to allow 
execution of the following operations associated with the next instmction. Instruction 
predicate operations are always executed independently of the strand enable status. If the 
instruction is the first in the strand then there is no preceding suffix to obtain the instruction 
address from, in that case the immediate load is included as part of the first suffix. 

As shown in the diagram, control arcs are added to serialise the operations for different 
host instructions in the CDFG. This ensures that any strand enable/disable caused by the 
instruction predicate unit only effects the relevant host instruction operations. All 
operations related to a particular host instruction are dependees of the prefix predicate for 
that instruction. This ensures the operations cannot migrate eariier than the predicate. 
The control arcs are labelled with the latency corresponding to the number of clocl< cycles 
required to update the SEM value after the execution of an instruction predicate. Also, the 
predicate for the following instruction is a dependee of all operations from the preceding 
instruction. This ensures that operations cannot migrate down into tiie operations 
associated with the following instruction. These control arcs occur within and between 
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strands. The arcs do cause a significant amount of serialisation of operations with a 
resultant adverse effect on the code density that can be achieved. However, these 
dependencies are only present for the shadow code. The shadow code is only required 
when debugging needs to be supported. 

In the case of a conditional branch operation, the operations associated with the 
instruction may bad of two successor PC values depending on whether the branch te 
taken or not This is illustrated in ttie diagram below. Note that the diagram does not show 
the control dependencies between operations required to serialise their execution. 




Instruction predicate prefix for branch 

Conditk>n ev2ilLiation and 
squash operations 

^ ^ / Im mediate load and register write of PC for 
following Instruction (fall through case) 

Immediate load and register write of PC for 
destination Instruction (branch case) 

"^^^ Branch 



Condition ^^^^^^"^ 
Evaluation Strand " 



Ordnary Operation 
WW Unconditional Op«-ations 




brancn / Iiri mediate PC bad and 

V Operation ^ -t^L Predicate for following 
^ / T^^ hstnjction 



Branch Strand 




F=ol lowing Strand"" 

^ •*» 



The first strand holds the instaiction predicate for the branch along vwth the operations to 
evaluate the branch condition. The operations also include squashes to control the 
execution of a subsequent strand in which the branch is held. The first strand also 
contains operations to load the PC value of the subsequent instruction after the branch. 
These operations are alv\/ays executed but may be subsequently ovenwritten by a branch 
destination PC value if the branch is actually taken. 

The following strand containing the branch loads PC with the destination for the branch. 
The strand also contains the branch itself. The operations within the strand are only 
executed if the branch is being taken. 

Finally, the following strand contains the operations associated with the instaiction after 
the branch. This strand is. only executed if the branch is not taken. The host instruction 
address is loaded and passed to the instaiction predicate unit to guard execution of the 
operations. The PC value is explicitly loaded since the instruction prefix is at the start of a 
new strand. 



8^ Shadow Code 

Shadow code is generated finom the original source instructions. It represents a less 
efficient translation than the main code but provides support for mstruction tevel debug. It 
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ailov\« the rrachine state that be generated by the original binary code to be 

reproduced to on a CriticalBlue processor. 

When generating shadow code a number of additional operations are added into the 
CDFG that are not present for the main code CDFG. These additional operations are 
used for implementing debug capability at a source instaiction level granuarity. The 
operations are as follows: 

□ PC Register Updates: At ttie transition of source instructions code is generated 
to load tine original PC address of the source instruction involved and store it in the 
PC register in tiie legister file. In ttie main code no attempt Is made to keep tiie 
PC register architecturally cooect witii respect to ttie original code as its value is 
implicit. 

□ Instruction Predicate Operations: These operations form a prefix to all source 
instnjctions and are used to allow instiuction granularity execution control in ttie 
shadow code. They also serialize the operation gnDups associated with different 
host instructions so that they do not become intermixed. 

In tiie shadow code all register file read and write operations are added into tine CDFG, 
These are not optimised in any way so tiiat ttie full register file state is maintained at 
source instruction boundaries. The shadow code is tinus considerably larger than tiie main 
code. It is also much slower as many operations are serialised around ttie instruction 
predicate operations. However; \h\s is not a major issue as tiie shadow code only needs 
to be executed when a breakpoint is activated and ttie code itself can be stored externally 
to code memory of a CriticalBlue processor. 



8.6 Debugging Translated Code 

Before an application is ever oin on real hardware can be tested in tiie CriticalBlue 
simulation environment. This allows full cycle and bit accurate testing. Stimulus and 
behavioural modelling code vyill be produced to emulate the physical environment that ttie 
application will be executed within. This process will allow ttie detection of most major 
bugs in tiie application. The simulation runs natively using a C/C++ environment. The 
engineer is able to use his or her fevourite debugger and integrated development 
environment. 

Of course, fliere are always likely to be application level bugs ttiat only manifest 
themselves in the real hardware environment. To allow easy analysis of these, 
CriticalBlue supports a powerful debug environment. This Is created from a combination 
of hardware structures embedded into every CriticalBlue processor, a monitor program 
present on ttie processor and special code produced by the CriticalBlue tools. 

A CriticalBlue processor Is able to provide a sufficiently accurate emulation of a 3"* party 
processor architecture tiiat debug tools for tiiat architecture can be used. The first release 
of ttie CriticalBlue tools will support ttie ARM architecture. ARM compatible symbolic 
debug tools running on a host PC platfonn can be used. This will provide source level 
debug capabilities. 

The overall debug architecture is Illustrated In the diagram below: 
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The host communicates with the target system via a serial or parallel link A serial link 

?Sf H K ^ ^^^<i'^re occupies. A remote debugging protocol is run over the Bnk 
The host debugger can send commands to the target system to set breakooints 
Sets rr^r' "^fr^^ When ^mr^ands arl SeT^ 

?hp5ni. ^ T '""^i^ ^'^^ ^" ^^^"dlers in the MetaMonitor oode 

These process the commands and send appropriate responses. The first version of the 

Z°tn ^ ^'^"a P^°*°«^'- Thus a CriticalBlue sy^em wiH 

be able to use any ARiVI debugger supporting the prx)tocol. 

A CriticalBlue processor also contains a number of breakpoint registers. These cause the 
processor to halt i^ the breakpoint is reached. They are utiLd if breakpoir^its are^t h It 
debugger A breakpoint register consists of a r^ion start address and a Jr^nd mmber • 
Execution is hafted if the particular strand in the region is executed Tis ato^ 
breakpointe that halt the machine on the equivalent of a particular origins instmc^on^ 
he code. Ail later strands are squashed so execution is effectively stopped ^thesourJ^ 
line, even if operations from later original instructions have been issued. 

Il^Ilfr ^T^^ of parallelism in the architecture, code can be scheduled out-of-order 
r^i^rl*!*'® "'^y ^ generated In a completely different order 

S expressed in the original code. The user should not need fo be aware 

of this. When they are debugging the code and single stepping through it thev exoect 
^T^glnaTcSe ^ s^uenSl Sr eSsfedl?! 
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9 Translation Related Algorithms 

9.1 Function Handling 
9.1.1 Function List Creation 

PunctionList CreateFunctionList {UsedMap, FixupList) 

// Examines all the code in the executable and generates a list of functions 
// (in terms of address ranges) by examining destinations of calls and 
// function pointer values. Ideally function entry points are associated 
// with symbols so that they can be located again in the code database. 
// However calls do addresses without are symbol are handled but the 
// function has to be rescheduled on each occasion, 

// UsedMap is returned as a bit map of all the locations in the executable 
// image that are used and cannot be overwritten with translated code 
// FixupList is a list of locations that need to be fixed up with the 
// addresses of particular functions (for function return links and 
7/ function pointers) 

// we examine the executable image and create counts of entry point usage 
// to try and identify individual functions - we also mark which words in 
// the image cannot be modified by the translated code 
UsedMap = cleared 
FixupList = empty 

entry_list = address 0 (main entry) and address 8 (SWI handler) 

foreach word in the executable image do 

if word is function pointer data and not in filler code then 

// if we have a function pointer then we make sure the addressed 

// function is an entry point - we add the entry word to the fixup list 

// so that a link is written to the start of the function 

mark word .as used in UsedMap 

add addressed function to entry list with infinite usage count 
add addressed function to FixupList (so that link is written to start 
of function) 

if addressed function is marked as being implemented in hardware or 

is an interrupt function then 

report an error 
endif 

else if word is data and not in filler code then 

// all data is used except if it is in filler code (used for 
// shadow code) where the whole image can be overwritten 
mark word as used in UsedMap 

else if instruction is a SWI then 

//a SWI instruction needs to be presevered (for the immediate value) 
// and the following word is used to hold the link- return address 
mark SWI instruction as used in UsedMap 

mark following instruction in FixupList with address of return from 
the SWI handler 
else if instruction is indirect branch for switch code then 

// for switch using an indirect branch we identify all the possible 
// destinations - each is written with a link to the translated 
// implementation of the code 

determine the range of possible values (by analysing range check code) 
foreach switch__dest in the possible range do 
mark switch_dest as used in UsedMap 

mark switch_dest in FixupList. so that link to translated code is 
placed in the switch table (if the entry does an immediate direct 
branch then place a shortcut to the destination function) 
mark switch_dest with infinite usage count so that it is a seperate 
entry point (although may become side entry of a region) 
-* endfor 

else if instruction is a branch then 

//we increase the reference count of the destination to be able to 
// handle functions that are only entered using a branch 
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add destination address to entry_list (or increase reference count 
if already in list) 
else if instruction is a branch and link (i.e. call) then 

// if we have a call then we ensure that the function becomes distinct 
// entry point and arrange for a link return address to be written 
add destination address to entry_list with infinite count 
mark following instruction in FixupList with address of return from 
the function 

±€ destination function is an interrupt function then 

// destination functions cannot be called directly as they have 
// additional code for preserving registers 
report an error 

endif 

endl£ 
endfor 

//we now go through the potential function entry points 
func_list = empty 

foreach cand_func in entry_list do 

Initially make the function all code between two forced entry points (i.e. 

addressed functions or destinations of calls) • 
Revisit all the code within the candidate function and reduce reference 

counts for branch destinations 
if there are non-zero reference counts in the function then 

Subdivide the function between each of the referenced addresses - this 
detects any other side entries that are branched to 

endif 

mark function as critical if entry symbol of function in critical list 
if function not on list of those implemented in hardware and 

function is not a code filler then 
//do not add behavioural model code for functions implemented in 
// hardware or filler code (to make executable big enough) 
CFLT - CreateCFLT (start address of function, end address of function, 

func_list (only backward information available) ) 
add function to func_list along with live inflow information from 

CLFT - this shows which register parameters are used by the function 
(non-volatile registers are masked off from the liveness even through 
they may appear live due to the save and restore code) 

endif 
endfor 

ret:urn func list 



92 Region Handling 
9.2.1 Region Ust Creation 

RegionList CreateRegionList (StartAddr, EndAddr, FuncType, FuncList) 

// Identifies the regions within a particular function (as specified 

// by an address range) . Each region also has a loop depth that is used to 

// weight requires for new resources during architectural synthesis. 

// FuncType is the type of function being scheduled 

// FuncList is the full list of functions in the code - this is used 
// to generate better liveness information for the funqtion 

// get the control flow and liveness information for the function 
CFLT « CreateCFLT (StartAddr, EndAddr, FuncList) 

// determine the maximum number of original instructions in a block 
if FuncType = Critical then 

// .if we are creating a critical function then we do not limit the 

// number of instructions in a block 

MaxBlockXntrs » LARGE VALUE 
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else 

MaxBlocklnstrs = maximum instructions in a block - calculated as 
a proportion of total function size (perhaps 25%) to 
break up basic blocks for parallel scheduling in short functions 

endxf 

// process the instructions ' in the function in ascending address order and 

// create individual regions as required 

region_list = empty 

cur_addr = StartAddr 

cur_depth = 0 

do 

region_depth =■ cur_depth 

new^region = Great eRegion (cur_addr, EndAddr, cur^depth, FuncType, 

MaxBlocklnstrs, CFLT) 
add new_region to region_list with depth of region depth 
until cur_addr > EndAddr repeat 

return regional! st 



9J2JZ Control Flow and Liveness Table Creation 

CFLT CreateCFLT (StartAddr, EndAddr, Funciist) 



// 
// 
// 



Generates the Control Flow and Liveness Table (CFLT) for the function 
// With the given address range. The table contains information about 
control flow within the function including sources for back edges. 
Full liveness analysis is performed within the function by scanning 
instructions and determining which registers they use and define. 
// If a function call can be found in the FuncList then the actual inflow 
// parameters for the function can be used for the liveness - otherwise 
// pessimistic assunqptions must be made 

// FuncList is the list of functions in the code along with their inflow 
// liveness information (which may be slightly pessimistic if the 
// function itself calls other functions) 

CFLT = create edge information table with entry for inflows and outflows for 
each instruction in the range' 

// initially scan the instructions in the function to find the control flow 
foreach cur_instr « StartAddr to EndAddr do 
if cur_instr is branch then 

if destination is in range of function then 

// update the CFLT with the appropriate edges 
create outflow edge information at cur_instr in CFLT 
create inflow edge information at destination instruction in CFLT - 
mark if the edge is backwards 

endif 

if cur_instr branch is conditional then 

// a pseudo edge is added for fall through cases 
add inflow to following instruction 
endif 
endif 
endfor 

// pass through the identified basic blocks and calculate the use and 

// definition vectors of registers within the function 

foreach basic block in the CFLT do 

determine the use and definition masks for registers in the block - all 
calls and branches to named symbols in FuncList use the inflow 
liveness from the entry otherwise they are considered to use all 
possible parameter registers. All function calls are considered 
to define all volatile registers. 

endfor 

// now determine the in and out liveness of each basic block - this is 
// done iteratively until a fixed point is reached 
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do 

foreach basic block in the CFLT do 

update the live in and live out of the bloclc 
endfor 

until no changes during pass repeat 
return CFLT 



Create Region 

Region CreateRegion (StartAddr, EndAddr, CurDepth, PuncType, 

MaxBlocklnstrs, CFLT) 
// Creates the CDFG for a region by translating original code between a 
// given start and end address. If necessary unrolls loops by duplicating 
// code to improve schedule potential. 

// CurDepth is the current loop depth at the start of the region and is 
// updated as required 

// FuncType is the type of function being scheduled - if the function is 
// critical then it should be for performance rather than code density 
// MaxBlocklnstrs is the maximum number of original instructions in a 
// block - this may be limited to break up blocks in short functions 
// to improve code density 

// CFLT is the previously calculated Control Flow and Liveness Table (CFLT) 

// reset variables for start of region 

CDFG = empty 

reg_defs = empty 

is_entry_strand « true 

num_dups'"=' 1 

total_weight « 0 

total_instrs « 0 

cur_addr = StartAddr 

end_type = EndRegion 

last^addr = maximum value 

// If we are generating the main execution code then we need to generate a 
// synchronisation branch to the start of the shadow code for the region. 
// This branch is taken if a breakpoint is encountered, 
if not generating shadow code then 

fixup_node = GenOp(CDFG, Wide Immediate Unit, shadow region start) 

mark fixup_node as being a fixup node 

(void) GenUnaryOp(CDFG, branch unit, synch branch, fixup node) 
endlf ~ 

// build the region 
do 

// Process the inflow edges for the current instruction. Avoid processing 
//an instruction for a second time (for instructions that cause a new 
// strand to be started). We also only process inflow edges on the 
// first unroll of. a loop. 

if num_dups « 1 and cur_addr 1= last_addr then 
last_addr = cur_addr 

if there are inflow edges for cur_addr in CFLT then 

// if we have reached an inflow edge that has been branched 
//to from an earlier strand in the region then we can delete 
// the branch associated with the strand and reduce the number 
//of strands since the branch strand is gone 
foreach inflow visited inflow edge from CFLT do 

if branch was from strand in the current region then 

// the branch is internal so delete the branch - the squash 
// condition is inverted and its new destination is marked 
// to squash all the strands after the branch and up to the 
// destination 

delete the branch node associated with the squashing_strand 
if the squash strand has another fixed outflow then 

//If the branch was originally unconditional then the squash 
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// may be in a much earlier strand that has already been 

// fixed - we need to replicate the squash instruction. This 

// handles IF-THEN-ELSE code. The unconditonal branch for the 

// end of the THEN clause modifies the original squash 

// instruction using an inverted condition to the ELSE 

// clause. 

replicate the squash instruction 
endl£ 

invert the squash condition 

mark the squash strand destination as the set of strands 
following the branching strand to the destination 

if the branch was the only operation in the strand then 
* // The branch was in a seperate strand - it was conditonal 
// with a different condition to the previous strand. The 
// strand is no longer needed so the total number of 
// strands is reduced. 

delete the phase barrier associated with the strand 
delete the strand itself 
endlf 
endif 
endfor 

// keep an ongoing count of the loop depth 

CurDepth « CurDepth + number of back edges into cur_addr 

endi£ 
endlf 

end_type = EndRegion 

if AdvanceToCode (CurAddr, EndAddr) then 
// find the end of the current block 
end_block_addr = cur_addr 

end_type = FindBlockEnd (CDFG^ end_block_addr, MaxBlockXnstrs, CFLT) 

// find out what the multiplier will be for the new strand 
strand_multiplier = CalcNewStrandMultiplier (CDFG, cur_addr, CFLT) 

// If this is an entry strand then the total weight of all previous 
. // instructions in the region is halved. This lowers the overall 
// density and thus reduces the chance that the entry strand will be 
// accepted into the region. Multiple entry strands are inefficient 
// execution performance, 
if is_entry_strand then 

total weight =» total_weight / 2 
endif 

// increase the total instructions by the number in the block and 
// the weight by the number in the blocked weighted by the strand 
// multiplier 

total_instrs = total__instrs + (end_block_addr - cur_addr) 
total_weight - total__weight + strand^multiplier * 

(end_block_addr - cur_addr) 

// We calculate the execution density. This is the weighted number of 
// instructions divided by the total number of instructions. As more 
// conditional blocks are accepted into the region the execution 
// density is lowered. 

exec_density = total_weight / total_instrs 

// Get the minimum acceptable density. There is a different value for 
// critical functions where performance is emphasised over code 
// density. 

if FuncType = Critical then 

min_density « CRITICALJtflN^DENSITY 
else 

min_density « N0RM7^_MIN DENSITY 
endif 
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// Translate the block if adding it will keep the execution density at 
// an acceptable level. This mechanism prevents too much conditional 
// code being added to a region that will increase its length too much. 
//If the conditional code is squashed then this significantly lowers 
// the effective performance of the machine. If we run out of strands 
// when translating the block then end the region, 
if exec_density >= min^density then 

if not TranslateBlock(CDFG, cur^addr, end_block_addr, RegDefs, 
is__entry_strand, strand_multiplier, CFLT) then 
end_type = EndRegion 

endif 
endif 

// If the block ended due to a call site then update the total 
// instructions appropriately. This lowers the execution density 
// when a call is encountered. Calls are inefficient if the region 
// has be returned to after the return. The inefficiency is related to 
// the number of instructions so far (which have to be skipped over 
/ / again on return to the same region) . It is multiplied by the strand 
// multiplier as if. there is a smaller chance that the call will be 
// executed anyway then there is less of an impact, 
if end_type « CallSite then 

total__instrs = total instrs + total instrs * strand multiplier 
endif " - *^ 

// there is a limit on total instructions in the region 
if total_instrs > MAX_INSTRUCTIONS then 

end_type = EndRegion 
endif 
endif 

is_entry_strand = false 

if end_type = InflowEdge then 

// The block ended due to an inflow edge. If all the edges are visited 
// then they are from previous strands and thus internal. Unvisited 
// edges cause the region to be ended for critical functions (which 
// emphasize performance over code density) . Other regions may 
// continue but the next strand will be an entry strand and there 
// are limits on operations and strands. 

if there are any unvisited inflow edges for cur_addr in CFLT then 
is_entry_strand = lirue 

if FuncType «= Critical or num_strands > MAX_CONTINUE_STRANDS or 
num_ops > MAX_CONTINUE_OPS then 
end_type « EndRegion 
endif 
endif 

else if end_type = OutflowEdge then 

// The block ended due to an outflow edge. If there is no route 

// into the following code (i.e. the code is dead) then make sure 

// it is an entry strand. This also includes a return instruction. 

if outflow edge is unconditional and cur__addr has no inflow edges then 

is entry strand = true 
endif" 

// reduce the loop depth as we pass backward outflow edges 
if if the outflow edge is backwards then 

CurDepth = CurDepth - 1 
endif 

// if we are scheduling a critical function and this looks like the 

// end of a loop (the outflow is backwards) then considering performing 

// code duplication to unroll the loop 

if FuncType = Critical and outflow edge is backwards then 

if edge dest is StartAddr and total_instrs < MAX_DUP_INSTRS and 

num^dups < MAX_DUPS and num_strands < M76f_DUP_STRANDS then 
//We duplicate code if we are branching back to the start of the 
// current region. There is a limit of the number of duplications, 
// the total number of instructions and the total strands thus far 
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// (to avoid only part of an iteration being unrolled) . We need 
// to modify the last branch' in the loop and set the address 
// back to the start of the region, 
if outflow edge is unconditional then 

delete the branch as we are unrolling code to follow it 
else 

mark cur_strand as invert condition and chsinge its destination 
to the current address (previous fall through) 

endif 

nuni_dups = nuin_dups + 1 

cur_addr = StartAddr 
else * 

// if we cannot unroll then we always end a region when a back 

// edge is detected in a critical function to avoid loaded 

// redundant code into a loop 

end_type = EndRegion 
endif 
' endif 

else if end_type = CallSite then 

// the next strand will be an entry strand to handle the return from 
// the call (this handles both direct and indirect calls) 
is_entry_strand *= <brue 

endif 

until end_type = EndRegion repeat 

// resolve the squashes by allocating strand numbers - any branches which 
// are left in the code are to locations external to the region and are 
// picked up as requiring fixups when generating the binary for the code 
Resolve Squa shes (CDFG, CFLT) 

// If this is a region in a fast interrupt function then we generate a 
// special return instruction in the final strand. This is used when 
// locking down the code for the function during the initialisation 
// sequence. 

if FuncType = Fastlnterrupt then 

StartNewStrand (CDFG, tmie, FuncType) 

read_op « GenRegRead (CDFG, RegDefs, link register) 

GenBranch (CDFG, read_op) 
endif 

create new_region from CDFG 
return new_region 



9.2.4 Squash Dependency Update 

void UpdateSquashDeps (CDFG, -CurAddr, CFLT) 

// Updates the squash dependency matrix. This details which strands precede 
// the current one in the control flow. This is used for determining which 
// predecessors are relevant when generating the register flow and other 
// dependency information in the CDFG. 

// CurAddr is the current instruction addressed being processed whose inflow 
// edges are examined 

// CFLT is the Control Flow and Liveness Table and is used to provide the 
// required control flow edge information 

deps_set = empty 

foreach inflow edge to CurAddr from another strand in the CFLT do 
add the branching strand to the deps_set 

add the set of dependent strands for the branching strand itself to the 
deps_set 
endfor 

make the dependencies for the current strand equal to deps_set 
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9JZ^ Squash Code Resolution 

void ResolveSquashes (CDF6, CFLT) 

// Examines all the squashing strands and provides the final nvunbering for 
// them. Generates the correct operands for the template squash instruction 
// and generates additional squash operations for each strand as required. 
■// CFLT is the Control Flow and Liveness Table and shows the control flow- 
// between strands 

// number all the strands 
strand_num = 0 

foreach strand in the CDFG do 
allocate strand_num to strand 

if strand has not been resolved or has a conditional spur then 

// the strand perform an external branch (with a spur branch) or 
// is associated with a conditionally executed instruction with a 
// spur 

strand_num « strand_num + 1 
endif ^ 

strand_num « strand_num + 1 
endfor 

foreach strand in the CDFG do 

// determine the set of strands that must be unconditionally squashed 

force_squash_set - empty 

if strand is an entry strand then 

// All entry strands (including the first strand) must squash all 

// strands that they cannot reach. Thus a strand containing 

// unreachable dead code will be squashed. Also if a call is in the 

// then part of an if-then-else then the entry strand after the 

// the call will squash the else part. 

for later_strand = strand + 1 to last strand do 

if later strand cannot be reached from strand then 

add later strand to force squash_set 
endif 
endfor 
endif 

/ / determine the set of strands that must be conditionally squashed 
cond_squash_set = empty 

if dest is resolved (branch has been removed) then 

//if the strand is resolved (i.e. the associated branch instruction 
//is not still remaining) then it is to an internal strand so we 
// need to generate the correct squash .conditions. If there are 
// multiple enclosed condition blocks then each must be set with 
// the same condition 

foreach later_strand = strand + 1 to last strand do 
if later_strand is dominated by strand and 

later^strand does not post-dominate all successors of strand then 
// the later strand can only be reached through the strand being 
// processed but it is not necessarily reached if the strand heim 
// processed is reached. If the later strand is post-dominated by 
/ / the strand being processed then there is no requirement to set 
//a squash (one may have been set by an even earlier strand) . 
add later_strand to cond_squash_set. taking account of any invert 
condition on the strand 

endif 
endfor 
endif 

// set the actual squash values to be used in the CDFG 
modify squash instruction in to accommodate forceps quash_set and 
cond_squash_set - add additional squash instructions if required 



endfor 
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9i2.6 New Strand Multiplier Calculation 

unsigned CalcNewStrandMultiplier (CDFG, Blocks tartAddr, CFLT) 

// Calculates the new strand multiplier for a new block. The strand 

// multiplier is used to weight the number of instructions in the block 

// in order to calculate the execution density. The execution density 

// is then used to. determine whether the block should* be included in the 

// current region or not. 

// BlockStartAddr is the start address of the current block 

// CFLT is the Control Flow and Liveness Table for the function 

if there are no current strands then 

// the first strand in the region has a multiplier of 1 
strandjnault =• 1 

else if there are no inflows to the strand from the CFLT then 

// if the new strand has not been initiated because of a new basic block 

// (perhaps it was ended by a call or store) then we keep the same 

// strand multiplier as the previous strand 

strand_mult = strand multiplier from the previous ' strand 

else if there is only one inflow to the strand from the CFLT and 

the previous strand ends with a conditional branch then 
// If there is only one inflow to a strand and it is via a conditional 
// branch. We make the multiplier half that of the previous strand on the 
// basis of a 50/50 chance of the branch being taken. 
strandjBiult » half the strand multiplier from the inflow strand 

else 

// If there are multiple inflows to the strand then we add the 

// multipliers of the preceeding strands together. Thus, for instance, 

//if this is the confluence point after a IF-THEN-ELSE construct then 

// the two clauses will have a multiplier of 0.5 - after confluence the 

// full multiplier of 1 is restored. 

strand_mult = 0 

foreach inflow to the strand that is internal using the CFLT do 
strand_mult =* strand_mult + multiplier of predecessor strand 
endfor 
endif 

return strand mult 



9.3 Code Block Handling 
9.3.1 Advancement To Code 

bool AdvanceToCode (CurAddr, EndAddr) 

// Advances the instruction pointer to the next block of executable code. 

// Returns false if the end of the executable is reached. 

// CurAddr is the current address that is advanced to point to code 

// EndAddr is the end address of the executable 

while CurAddr points to data and CurAddr < EndAddr do 

CurAddr = CurAddr + 1 
endwhile 

return CurAddr < EndAddr 



9.3.2 Block End Determination 

// possible reasons for ending a block translation 
enum BlockEndType { 

EndRegion, // the end of the region is reached 
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Excesslnstrs, // there is a limit of instructions in a block used to 
// break up long blocks in short functions into 
// different strands so they can be run in parallel for 
// achieving better code density 

InflowEdge, // if an inflow edge is detetected then this ends 
// the basic block 

OutflowEdge, // any branch or return instruction ends the block as 
// this delimits the basic block 

Storelnstr, // a store for which there are potentially aliased 
// loads in the same basic block - breaking them 
// into seperate strands gives the opportunity to 
// speculative the load before the store 

CallSite) // a call (direct or indirect) ends a block as a new 

// strand is required after the return 

EndType FindBlockEnd(CDFG, CurAddr, MaxBlocklnstrs, CELT) 

// Finds the end of a block of instructions into the CDFG, A block continues 
// until the basic block is ended, the excess instructions limit is reached, 
//a call is reached or certain store ins tmjct ions are reached. The extent of 
// the block is determined prior to performing a translation of it. This 
// allows a decision to be made whether the block should be included in the 
// current regiqn or a whether a new region should be started. Returns the 
// reason for ending the block. 

// CurAddr is the start address to search the block from - on return this 
// points to the instruction after the end of the block 
// RegDef s is a table of the current definers for all the architectural 
// registers 

// MaxBlocklnstrs is the maximiara number of original instructions that are 
// permitted within a block - this may be limited to break up basic blocks 
// in short functions to improve code density 
// CFLT is the Control Flow and Liveness Table for the function 

end_reason = EndRegion 
intrs_in_block » 0 
do 

// the instruction was successfully translated 

if instruction at CurAddr is a call (direct or indirect) then 

//we have encountered a call so we end the block as the following 

// code will be in a new strand 

end^reason =» CallSite 
else if instruction at CurAddr writes to PC then 

//we have encountered a return 

end_reason = OutflowEdge 
else if there is an outflow edge for CurAddr in the CFLT then 

// we have encountered a branch — we mark if as being 

// visited so that the destination block can potentially 

// be included in the current region 

end_reason = OutflowEdge 

mark the destination of the edge as visited 

else if instrs_in_block >« MaxBlocklnstrs then 
// there is a limit of instructions in a strand 
end_reason =» ExcessXnstrs 

else if there -are any inflow edges for CurAddr in CFLT then 

// advance the address and see if there are any inflow edges to 
// the new address - the basic block is ended if so 
end_reason = InflowEdge 

else if the instruction at CurAddr is a store then 

// If the instruction is a store and there are load instructions 
// following it in the basic block that may be aliased (not 
// definitely or definitely not aliased) then the block is 
// ended. This allows the following load to be in a different 
// strand and thus potentially speculated earlier than the store 
// using chaz guards. 

foreach follow_instr in the same basic block do 

if follow_instr is a load that may be aliased to CurAddr then 

end reason = Storelnstr 
endif " 

endfor 
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end if 

// advance the address and instruction count 
CurAddr = CurAddr + 1 
instrs_in_block « instrs_in_block + 1 

// continue until we have an end reason, we reach data or get to the end of 
// of the executable 

until end_r6ason != EndRegion or CurAddr >« EndAddr or 

CurAddr points to data repeat 

// the block has been ended so return the reason why 
return end reason 



9.3^ Block Translation 

bool TranslateBlock(CDFG, CurAddr, BlockEndAddr, RegDefs, IsEntryStrand, 

^ ' StrandMultiplier, CEXT) 

// Translates a block of instructions into the COro. Translation continues 
// untxl the block is ended or until the maximum number of strands limit is 
// reached. The translation of the block will require the creation of one or 
// more strands. Additional strands are required if conditional instructions 
// extent of the block is predetermined by the 

// FmdBlockEnd function. Returns false if the maximum number of strands 
// has been exceeded and the region must be ended. 

// CurAddr is the first instruction in the block to be translated. This 
// IS updated by the routine, 

// BlockEndAddr is the first instruction after the end of the block to be 
// translated. 

// RegDefs is a table of the current definers for alX the architectural 
// registers 

// ^^^"^^M^^frf/^ ^''''^ ^^"^ ^^""^ is an entry strand 

// StrandMultiplier is the multiplier for calculating the execution Light 
// of all strands in the block ' 

// CFLT is the Control Plow and Liveness Table for the function 



do 



// confluence the register definitions to take account of the inflow 
Conf luenceRegDef s (CDFG, CurAddr, RegDefs, CFLT) 

// a new strand is started - a guard operation and phase barrier 
// are constructed 

StartNewStrand(CDFG, IsEntryStrand, StrandMultiplier, FuncType) 

// update the squash dependency information 
±£ there are inflow edges at cur_addr then 

UpdateSquashDeps (CDFG, cur_addr, CFLT) 
else 

mark the new strand has having the same dependencies as the 
previous strand in addition to the previous strand itself 

endi£ 

//any further strands created for the block will not be entry strands 
IsEntryStrand « false 

// translate instructions until the end of the block or until a particular 

// instruction translation requests a new strand 

do 

continue_block = Translatelnstruction (CDFG, CurAddr, RegDefs, CFLT) 
ixntxl not continue_block or CurAddr = BlockEndAddr repeat 

// if there are any registers that have been defined in the strand and are 
// live outside it then they need a dependency on the sink node to ensure 
// the write occurs 

SinkLiveDefs(CDFG, CFLT, RegDefs, last instruction address) 
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// the register defintion information is saved so that it may be 
// retrieved by any subsequent confluence between strands 
save the RegDefs information into the state for cur_strand 

// End the region when the maximum number of strands have been used A 
// spare strand must always be left to handle a conditional strand for a 
// conditional instruction. Another strand must be left spare for a ' 
// region in a fast interrupt function as this has a special return 
// strand. We do not perform the check if the instruction translation 
// has requested a new strand as we cannot terminate the region at that 
/ / point . 

if continue_block then 

max_strands = MAX_STR7^DS - 1 
if FuncType « Fast Interrupt then 
raax_strands = MAX_STRANDS - 2 
endif 

if current number of strands > max_strands then 

return false 
endlf 
endif 

// repeat until the end of the block unless we run out of strands 
until CurAddr « BlockEndAddr r^eat 

// we successfully finished the block 
return true 



9.3.4 Uve Register Sinking 

void SinkLiveDef s (CDFG, CFLT, RegDefs, LastAddr) 

// Sinks the live definitions at the end of a strand. Data dependency arcs 
// are generated from each register definition in the current strand that 
// IS live at the end of the strand to a special sink node. This prevents 
// the register write being optimised away. 

// CFLT is the Control Flow and Liveness Table for the function 
// RegDefs is the current set of register definitions 
// LastAddr is the last instruction in the strand 

live_set = the set of live out registers at LastAddr obtained from CFLT 
foreach reg_def in RegDefs do 

if reg^def is in live_set and register was defined in current strand then 
create data flow arc from register definer to sink node for CDFG 

endif 
endfor 



9.3.5 Register Confluence Handling 

void ConfluenceRegDefs(CDFG, CurAddr, RegDefs, CFLT) 
// Updates the register definition information when inflow edges are 
// reached. The register definitions are formed from the union of 

// 



// the register definitions flowing into all the incoming edges. 

// Duplicate entries are removed but some registers might have a list 

// of potential definers associated with them (if some register writes 

// are in conditional sections of code) 



// CurAddr is the instruction which has an inflow edge represting a 
// new basic block 

// RegDefs is the table of current definers for the architectural 

// registers - this is updated to represent the inflow for the new 

// basic block 

// CFLT is the Control Flow and Liveness Table that is also used to 
// hold the register definers at the end of previous basic blocks 

delete the current RegDefs 
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clear the first use for all registers 

foreach inflow edge from another strand in the region do 

add the register definitions left at the end of the strand into the 
RegDef s - removing duplicate entries 
endfor 



9.3.6 New Strand Creation 

void StartNewStrand(CDEX3, IsEntryStrand, StrandMultiplier, FuncType) 
// Initiates a new strand. A guard operationa and a phase barrier are 
// generated for the new strand. Dependencies are generated onto 
// any proceeding squash operation for the strand, any proceeding phase 
// barriers and any proceeding branch. 

// IsEntryStrand is true if the strand is an entry point to the region 
// StrandMultiplier is a multiplier for the strand for calculating the 
// execution weight. Conditional strands are given a lower multiplier on 
// the assumption that each conditional branch has a 50% chahce of beina 
// taken. 

// FuncType is the type of function being scheduled 
create new strand entry in the CDFG 

mark the new strand information with the IsEntryStrand flag 
mark the new strand with the .StrandMultiplier 

// Each strand has a guard operation that is dependent on the phase 

// barrier (i.e. it must be issued beforehand). This is a conditional node 

// that is only issued if there is a weak dependency arc violation. It 

// aborts the strand if it is not the lowest numbered strand being executed. 

guard_op = GenOp(CDFG, guard unit, guard) 

if IsEntryStrand and entry is for true side entry (not call return) then 

mark guard_op so conditional arcs are not connected to it - it must always 
be issued as this is an entry strand 

endlf 

// Commit nodes separate speculative and committed phases. Commit 
// nodes ordered themselves (so order of strands maintained) . Non 
// speculative instructions such as stores and register writes behind 
// the commit. 

phase^op « GenOp(CDFG, commit unit, commit) 

// the guard must come before the committed phase of the strand 
generate control arc between guard;_op and phase__op 

if there is a current squash instruction then 
if FuncType = Critical then 

//A strong arc is generated between any squash instruction associated 
// with the strand and the phase barrier. This ensures that the 
// committed phase of the strand only starts when its squash status 
//is known. 

create a strong control arc from preceeding squash instruction to 
phase_op 

else 

// A weak arc is generated between any squash instruction associated 
// with the strand and the phase barrier node itself. The conditional 
/ / arc corresponding to the weak arcs is to the »guard operation 

// for the strand. Thus if the squashes for the strand are not resolved 
// by its committed phase then the strand is aborted unless it is the 
// lowest strand being executed. 

create a weak control flow arc from the preceeding scjuash instruction 
to phase^op 

create a corresponding conditional arc between the squash instruction 
and guard_op 

endif 
endif 

if there is a preceeding strand then 
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if FuncType = Critical then 

// A strong control arc is generated to the phase barrier of any 
// previous strand - this enforces the strand commit order ^ 
create a strong control flow arc from the phase bariier to the 
preceeding strand phase op 

else ~ 

// A weak control arc is generated to the phase barrier of any 
// previous strand - this enforces the strand commit order The 
// conditional arc is td the current strand guard so that if the 
// activated" then the guard operation is 

"\\^\ndTiihaTe'':p '^^^ ^^^^^^ preceeding 

" b«^i\:*'^'l'^rrd%"=°"^"'°"^' preceeding phase 

endif 

if the previous strand has a branch operation then 
if FuncType = Critical then 

V/ ^L^^i^''^''^?"^ stranhd has a nranch operation then a strong 

i«,^fH ^« generated to it. This ensures that the branch is 

// issued before the current strand commits so that the current 
// strand may be aborted if required. 

''''yhase.op'''"^ control flow arc from the previous branch to 

else ~~ 

// arc^Js afnli!i?L^^".5 ^''^"'^^ operation then a weak control 

// ^ILtl 1° - t^is ensures that the branch is issued 

// be^orSd ^**H^"l'=°'™^^f '="^'««t strand may 

// g^aS 'required. The conditional as to the current strand 

tlttll t "^^^ control flow arc from the previous branch to phase op 
iard.op'''''^^^^" ^"^ conditional arc from the previous branch to 
endlf , 
endif 
endxf 



9^ Code Translation 
9.4.1 Instruction Translation 

/^^rans^ates*^h^!^?°"°"^''°'^^^^^^^ Regoefs, crLT, FuncType) 

// ?he fi^^^^,,^? '^''^I! ^''"f"^ instruction into entries within the CDFG. 

// lllftZ^Tt ^^^^""^fof^d into primitives to read operand registers, 

// the m^lliL^^'-^i*"" required execution unit and then write back 

// mft if ^^^f f""*^ operations (such as load/store multiple on 

// "^y be transformed into many operations. Instructions 

// with conditions, such as conditional branches and conditional 

// instructions, are broken down into the conditional evaluation into a 

// ^^^^J,^"«t«>-tion, followed by the actual instruction SnsLtion in 

// the following strand. In such a case the function requests a new strand 

// without updating the current address. strand 

V/ ^ K^^^^K * required. This may be the case if 

// »i 1' ^^""^ J software interrupt is translated. A new strand may 
// ^inHH^^nT"!. translating conditional instructions and the 

// o ^^'^ current strand is not that required. 

// reilr2r '^"^''^ Control and Data Flow Graph that is updated as 

// CurAddr is the address of the instruction that should be translated - 
// this is updated by this routine <^ransj.acea 

// ""^reSsttt^^ °^ ^''"^"'^ definers for all the architectural 

// «9"ters - this is updated as required and multiple definers 

// Cfm'?^ ^ho*^"^^^'*, ^ register update is conditional 

// ™ ^ Control Flow and Liveness Table for the region 

// FuncType is the type of function being scheduled 
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// at the start of a new strand we know that the entry conditions will be 
// that of the condition of the instrcution actions wij.j. De 

i£ no operations have been issued in the current strand thM» 

make the current strand condition the same as that of the current 
instruction 

endif 

// Generate an instruction PC load and predicate. We do not generate a 
// second predicate for the same instruction. If an instruction is 
// conditional then the predicate will be generated before the condition 
// evaluation and squash. The code for the instruction will actually be in 
// the following strand. Thus any breakpoint will occur whether or not the 
// condition is true. However, if there are subsequent instructions with the 
// same condition then they will be in the conditional strand. Thus the 
// breakpoint check will only occur conditionally. This differs from the 

// thrfi,c3:?:Jt^y^''^'''""'- ^^^-^^^^ -^'y 

if last_pred != CurAddr then 

updated_PC = GenPCUpdate (CDFG, CurAddr) 
GenlnstrPredicate (CDFG, updated_PC) 
last_pred = CurAddr 

// generate any required entry code If this is the first instruction 
// in a function - the code is generated after the instruction predicate 
// for the first instruction so effectively executes as part of its 
// behaviour 

if this is the first instruction in the function then 

GenEntryCode (CDFG, RegDefs, CurAddr, FuncType) 
endif 
endif 

// translate the current instruction 

if instruction condition is not the same as that of the strand then 

// If we have a conditional instruction and the condition for the strand 
// IS not the same then we evaluate the condition - which will generate a 
// squash instruction. We then end the current strand without advancing 
// the instruction address. When we try and translate the same instruction 
// again we will have set the correct condition for the strand. 
GenCondEval (CDFG, condition from instruction) 
return false 

else 

// the strand condition is correct so translate the instruction 
if instruction is branch then 

// We translate a branch and move onto the next instruction - however 

// the current strand is ended. If the branch was conditional then the 

// branch will have been produced in the conditional strand. 

GenBranch(CDFG, destination address) 

CurAddr = CurAddr + 1 

return false 
else if instruction is branch and link then 

if instruction is branch to hardware implemented function then 

// we have a software call now implemented as a functional unit - we 

// translate the call but we do not need to end the strand 

TranslateHardwareFunc(CDFG, RegDefs, CurAddr, FuncType) 
else '"^ 

// We translate the call and move onto the next instruction - however 
// the current strand is ended. If the call was conditional then the 
// code will have been produced in the conditional strand. 
TranslateCall (CDFG, RegDefs, CurAddr, FuncType) 
CurAddr = CurAddr + 1 
return false 
endif 

else if instruction is software interrupt then 

// software interrupts are translated like calls and they end the 
// strand 

TranslateSof twareint (CDFG, RegDefs, CurAddr, FuncType) 
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CurAddr = CurAddr + 1 

retujm false 
else if instruction is data processing then 

TranslateDataProc(CDFG, RegDefs, CurAddr, FuncType) 
else if instruction is multiply then 

TranslateMultiply(CDFG, RegDefs, CurAddr, FuncType) 
else if instruction is memory access then 

TranslateAccessSingle (CDFG, RegDefs, CurAddr, FuncType) 
else if instruction is multiple memory access then 

TranslateAccessMultiple (CDFG, RegDefs, CurAddr, FuncType) 
else if instruction is swap then 

TranslateSwap (CDFG, RegDefs, CurAddr, FuncType) 
else if instruction is CPSR access then 

TranslateCPSRAccess (CDFG, RegDefs, CurAddr, FuncType) 
else 

instruction type is unknown so report an error 
endif 

// we advance to the next instruction and indicate that the instruction 
// was translated successfully 
CurAddr = CurAddr + 1 
return true 
endif 



CriticalBlue 



9.42 Entry Code Generation 

void GenEntryCode (CDFG, RegDefs, CurAddr, FuncType) 

// Generates special entry code for a function. For the main entry point to 
// the program thas generates code to setup the interrupt handlers. For 
// interrupt handlers themselves this adds the extra code to preserve the 
// volatile registers. For the SWT handler this adds extra code to store 
// the recjuired registers. 

// RegDefs is the current register definitions 

// CurAddr is the address of the first instruction in the function 
// FuncType is the type of function being scheduled 

// if the function is the main entry point then we need to setup the 
// system interrupts 

if CurAddr = the main entry point then 

// We setup all the fast interrupts. The handler functions are loaded into 

// the instruction buffer and then locked down so that they are always 

// available. This is achieved by calling the final strand in a special 

// mode. This locks down the entry in the instruction buffer and 

// immediately returns. Unlike with normal calls the actual return address 

// is loaded into the link register. This actually requires multiple 

// strands so needs to return multiple times for a new strand. 

generate code to clear instruction buffer and reset base register to 0 and 

disable interrupts 
foreach fast interrupt function do 

// set up ready for the call 

immed_op = Genlmmediate (CDFG, direct return value, true) 
GenRegWrite (CDFG, RegDefs, link register, immed^op, main result port, 

FuncType) 

entry = the address of the fast interrupt entry from the configuration 
file (fixed up address) with the last strand which 
immediately returns 

// perform a branch but use a special method for calling so that it 
// causes a lock down 
GenBranch (CDFG, entry) 
endfor 

// we setup the the entry points for the interrupts into the appropriate 
// branch registers 

foreach interrupt (fast and slow) do 

entry = the address of the slow interrupt from the configuration file 
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(fixed up address) 

generate a method on the branch unit to load the entry point into the 
interrupt branch area 
end£or 

enable interrupts 

else if FuncType = Slowlnterrupt or FuncType « Fastlnterrupt then 

// If the function is an entry to an interrupt then we have to generate 
// special code to save the contents of all the volatile registers - these 
// are not normally saved as part of the entry sequence. Special registers 
// are available for storing the values - nested interrupts are not 
// permitted so they are not overwritten. 

foreach volatile register, stack pointer and condition code registers do 
generate code to load volatile register then store into special storage 
register 
endfor 



// interrupt routines have their own private stack (address is 
// defined in the configuration file) - this is required because the 
// state of the stack cannot be guaranteed at a cycle boundary level in 
// the main code 

generate code to load stack pointer with fixed value from MetaMapper 
command line option 

endif 

else if FuncType = SWIHandler then 

// If the function is the entry to the SWI handler then we save copies of 
// various registers into other registers. This emulates the behaviour 
// of a real ARM .processor that enters SVC mode when entering the SWI 
// handler. The stack pointer and link registers are saved along with 
// the condition codes. 

save a copy of the stack pointer to the rl3_svc register 
save a copy of the link register to the rl4_svc register 
save a copy of the condition code registers into extra registers 
copy the alternative link register to the main link register 
endif 



9A3 Call Translation 

void TranslateCalKCDFG, RegDefs, CurAddr, FuncType) 

// Translates a call operation. This is a direct call operation - 

// indirect calls are handled elsewhere as writes to the PC register 

// with a seperate explicit write to the link register. The return 

// address is loaded to the link register as an immediate value. The 

// load is marked as a fixup node to avoid the function being 

// rescheduled for each PC address change. The branch is then 

// performed in the same strand (if it is a conditional branch then 

//a special strand will have already been created) . All architectural 

// registers are cleared as subsequent code does not need a dependency 

// as the call return restarts the strand. 

// RegDefs is the current register definitions 

// CurAddr is the address of the CPSR access instruction' 

// FuncType is the type of function being scheduled 

/ / the link register is written at the end of the strand prior to 
/ / the actual branch being performed 

immed_op = Genlmmediate (CDFG, PC value after BL instruction, true) 
mark immed_op as a fixup node 

GenRegWrite (CDFG, RegDefs, link register, immed_op, main result port, 

FuncType) 

// the actual branch is generated to be dependent on the phase barrier for 
// the current strand - it must be issued before the barrier 
GenBranch (CDFG, destination PC value) 

// Any current register definitions are cleared - this prevents any register 
// dependencies being generated to before the call. Although non-volatile 
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// registers maintain the same value this is not relevant to schedule 
// dependencies since the region is restarted after the call. Even return 
// registers are not made dependent because the region is restarted 
foreacjh architectural register do restarted. 

clear any definitions for the register 
end£or 



9A4 Hardware Function lYansiafion 

void TranslateHardwareFunc(CDEX3, RegDefs, CurAddr, FuncType) 
// llTV"^^^^ ? hardware function call. This is a software function call in 
// =°de to a function that is noted as being implemented in 

// Snif ^MuitSle'r' ^r^-^r^ ^""'^ operands V?iirhardware 

Poi;;ed"^oTr^fe^^:ce%::S;;:t:rr ^'^"""'"^^ '''"'"^ ^° ^^^^'^-^ 
// RegDefs is the current register definitions 
// CurAddr is the address of the CPSR access instruction 
// FuncType is the type of function being scheduled 

// obtain the details for the hardware operation to be performed 
hardware_unit - the unit to be used as obtained from the conf iauration file 
hardware_method = the method to be used as obtained from the conf^ratSi 

file 

// J'h^'*>,^^^ required parameters to be made available as operands to 
// reaistfJrr the input parameters are loaded Irom 

// registers or stack as required. 
in__params = empty 

foireach param_num parameter to function do 
if param_num is an input parameter and 

if r,»^=n,»^- • the this parameter for a C++ function then 

If parameter is in a register then 
/ / simply read a register parameter 

" <5enRegRead(CDFG, RegDefs, param_num register, FuncType) 

// the^steon^e*' """^^ ''^^'^ appropriate location on 

immed_op = Genlmmediate (CDFG, stack frame offset of parameter false) 
base_op - GenRegRead(CDFG, RegDefs, stack pointer) P^''^"*^^®^' 
nfrf^^or," <^^"Si""yOp(CDFG, addition unit, add, base op, immed op) 
param_op = GenUnaryOp (CDFG, memory unit, load, addr op) ~ 
^^'^'^emoryDependencies (CDFG, param_op, FuncType) 

//the parameter is added to those passed to the hardware unit 
add param_op (main result port) to in params 
endif — ^ 
endfor 

/ / perform the actual hardware operation 

hw_op - GenOp(CDPG, hardware_unit, hardware_method, in_params) 

// if the operation produces a result then write to the result 
// architectural register 

if the function has a return parameter then 

GenRegWrite{CDPG, RegDefs, register 0, hw_op, main result port, 
endif FuncType) 

// Store any output parameters from the hardware operation. These are 
// fdH^f^r! S^^r^''^^^' ''*^«/«P"t parameters are loaded to obtain the 
// address and then the result port from the hardware unit is read and the 
// Je ^ordiJ^ed ^PP-P-^ate location. Note that all such resuUs Just 

foreach param_num parameter to function do 
if param^num is an output parameter then 



60 



CriticalBlue ^ InstructionSet Translation 

/$ oi^^i" *!?K P*'^®'^®' address as required from either a register 

pLSeter L^?r°^'^"^^ ^^"^"^ parameter location ^ ^" 

ir parameter xs in a register then 

^^addr_op - GenRegRead(CDFG, RegDefs, param.num register, FuncType) ' 

^AddMemoryDependencies(CDro, addr_op, FuncType) " ^ 
//write the result to the memory location 

write_op - GenBinaryOp{CDFG, memory unit, store, addr op, hw op 
,^ ^ _ (appropriate result port) ) ~ 

AddMemoryDependencies(CDFG, write_op, FuncType) 



endif 
endif 



9^5 Software Interrupt Translation 

Jf?rl"?«j!!^^°"^"^^"^^^°^*^' RegDefs, CurAddr, FuncType) 

// ll^^t t ^ software interrupt. This is converted i^^o a call " 

// to the entry point of the SWl handler at address 8 . The returJ 

// in?rthe^LJ°^^f^^"*° ^ 1^"'^ register. This ^s copied 

// ^^"'^ register by special entry code for the SWl 

II current register definitions 

11 ^ t " ^"^^ address of the CPSR access instruction 

// FuncType is the type of function being schedul^ 

'I'l r^t!^^^* ""^^w""* address to a special alternative linJc register - 

S.Mie,.rit.(CDre, R.5De£,, .lt.„,tlve ll„k r.gi.t.,:, l»,ed op, 

main result port, FuncType) 

// perform the actual branch to the SWl handler 
GenBranch(CDFG, SWl call handler) 



9A6 Data Processing instruction Translation 

^P^'t'i^^'^^l^^^^^'^^^^^oo^^X^^. Regoefs, CurAddr, FuncType) 

// oper^iions tJ .''''" ff" Processing instructions . ?SeU are binary 

// »HH^! ° mostly write back to a destination register All 

// addressing modes are handled and all the necessarj llags aenerated 

// ^"^m^ ^'^^ current register definitions ^ ^ generated. 

// S"^^ ""^^ address of the CPSR access instruction 

// FuncType is the type of function being scheduled 

°^ ""^^ allocated for instruction 
method =. the method for the instruction 
If addressing mode is iimnediate then 

if instruction is add and S is not set and Rn is PC and 
// If VH- ^ addressing mode is immediate then 

// til^T^ immediate offset to the PC register then 

// this IS converted into a direct latch of an immediate address - the 
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// operation Is marked as a fixup so any change of address does not 
// force re-scheduling of the function 

^< calculated immediate address using PC + iimnediate offset 
addr is not in a data area tlien 

generate an error 
endif 

op = Genlmmediate (CDFG, addr, truQ) 
mark op as being a fixup instruction 
else 

// if the operation uses a register with an immediate second 
'// ^et^lT.^^^^^^^ ^^"'^^^ ^-^^^^^ P«^orm the 

opl - GenRegRead(CDFG, RegDefs, Rn operand, FuncType) 
op2 = Genlmmediate (CDFG, immediate value, false) 
op = GenBinaryOp (unit_type, method, opl, op2) 
endif 

else if addressing mode is immediate shift register then 

// the operation shifts the second parameter by a fixed amount - 
// form the required second operand, then the first operand and then 
// perform the required function 
if shift amount is not 0 then 

// f6rm the required second operand - it is formed first to avoid 

// disurbing the first operand 

op2 = GenShiftImmed(CDFG, RegDefs, shift type from Sh field, op2, shift 
^^^^ amount, instruction is not arithmetic, FuncType) 

o^/^ the shift amount is 0 then this is a register-register operation 
op2 « GenRegRead(CDFG, RegDefs, Rm operand, FuncType) ^P^^acion 

endif 

opl = GenRegRead(CDFG, RegDefs, Rn operand, FuncType) 
op « GenBinaryOp (CDFG, unit^type, method, opl, op2) 
else 

// addressing mode must be register shift with a register amount - the 
// second operand is formed, then the first operand is formed and then 
// the operation is performed 

op2 = GenShif tReg (CDFG, RegDefs, shift type from Sh field, 

n ^ °P«^^"d' operand, instruction is not arithmetic, FuncType) 
opl = GenRegRead(CDFG, RegDefs, Rn operand, FuncType) «^nciype, 
op - GenBinaryOp (CDFGf, unit type, method, opl, op2) 

// ttiJ^"^^ ^^^^ compare operations write back the result to an 
// architectural register 

if instruction is not Test or Compare (that do not write 

n^^T, M i4. write to destination register) then 

^^enRegWrite(CDFG, RegDefs, Rd operand, op, main result port, FuncType) 

^/ iL'^UonH^f}^' ^^/^^ ^^l'^ ^""^ writing to the PC register) then 

// l^l ll^tt^ ""t^^ 2^^"^ ^° - obtained by reading 

^€ a li«f "J^^"*" ^-f^^ ^''^ ""^^ performing the operation ^ 

xf S flag set (for writing condition codes) and 

// 4-1,^ KT ^ ™ destination register is not PC then 

// the N and Z flags are set for all operations 

GenRegWrite(CDFG, RegDefs, register for N flag, op, N result port, 

FuncType) 

GenRegWrite(CDFG, RegDefs, register for 2 flag, op, Z result port, 

FUncType) 

if operation was arithmetic then 

// the C and V flags are only set for arithmetic operations 
GenRegWrite(CDFG, RegDefs, register for C flag, op, C result port, 

FuncType) 

GenRegWrite{CDFG, RegDefs, register for V flag, op, V result port, 

FuncType) 

endif ^ 
endif 
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9A.7 Multiply Tk-anslation 

void TranslateMultiply(CDFG, RegDefs, CurAddr, FuncType) 

// Translates a multiply instruction. Both 32 bit and 64 bit variants of i-h« 

// i^IirSe'sta^f T^lT^t- -"iPly-accumulate to^ Ire "Cpported 

// a 32 bit b5 hf? f^^ f""" accumulate. The Implementation belies on 

// a 32 bat by 32 bit multiplier producing a 64 bit result 

// RegDefs is the current register definitions 

// H""^*^*^ ""-f address of the CPSR access instruction 

// FuncType is the type of function being scheduled 

// perform the basic multiply operation - this multiplies two 32 bit 
// values to produce a 64 bit result 

l^^^^T^ ' GenRegRead(CDFG, RegDefs, Rm operand, FuncType) 
muf 7. ^^?«"t^RegRead(CDFG, RegDefs, Rs operand, VuncType) 
mul_op - GenBxnaryOp(CDFG, multiply unit, signed/unsigned m^^tiply depending 

on instruction, left_op, right_op) 

// handle the remainder of the instruction depending on its tvoe 
if instruction is MUL then ^ 
// a simple multiply writes back a 32 bit result 

GenRegWrite(CDFG, RegDefs, Rd operand, mul_op, lower result port, 
, . FuncType) 

else if instruction is MLA then 

// * multiply-accumulate then read the 

// :c^S:t:r """^ ^"'^ ^--^ the 

«hH^ ° GenRegRead(CDFG, RegDefs, Rn operand, FuncType) 
^ n V.f^'J®^"^''^'^''^^' addition unit, addition, mul op, aoc od) 
GenRegWrite(CDPG, RegDefs, Rd operand, add_op, mai^ result port? 
, . FuncType) 

«lse If instruction is UMULL or SMULL then 

// i^-^^V^u""!!^"" ^ ^^5ned or unsigned multiply then 

// write back the full 64 bit result into two registers 
GenRegWrite(CDFG, RegDefs, RdLo operand, mul_op, lower result port, 

FuncType) 

GenRegWrite(CDFG, RegDefs, RdHi operand, mul_op, upper result port, 
else FuncType) 

// aL^'to'^th! Vl^^"" multiply acc^bnulate then we must read and then 

aoc_op - GenRegRead(CDFG, RegDefs, RdLo operand, FuncType) 
add_op = G6nBinaryOp(CDFG, addition unit, addition, mul_op - 
^ _ . , lower result port, acc op) 

GenRegWrite(CDFG, RegDefs, RdLo operand, add_op, miin result port, 

FuncType) 

acc_op = GenRegRead(CDFG, RegDefs, RdHi operand, FuncType) 
param_list = mul_op - upper result port 
add acc_op to param_list 

add add_op carry output port to parara list 

add_op = GenOp(CDFG, addition unit, addition with carrv oaT-am n 
GenRegWrite(CDFG, RegDefs, RdHi opUan^ adSJprmIi"^;sS"^orir' 
endif FuncType) 

// t of the instruction is set then the N and Z flags need to 

// be updated - N is the most significant bit of the result and Z is set 

// if the whole result is 0 

if S bit is set then 

bit copy from bit 31 of upper result word to N flag 
set Z flag if result (one or two words) is zero - use OR operation 
for two word case 

endif 
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9A8 Single Memory Access Translation 

void TranslateAccessSingle (CDFG, RegDefs, CurAddr, FuncType) 

// Translates a single memory access instruction. This might be for 

// a word, half word or byte. All addressing modes are handled along 

// with write backs to the base register for pre/post increment modes. 

// RegDefs is the current register definitions 

// CurAddr is the address of the CPSR access instruction 

// FuncType is the type of function being scheduled 

// calculate the address to use for the load 
if address is pre-indexed then 

// calculate the full address if pre-indexing - and write it back 

// to the. base register if required 

addr^op = GenerateAddress (CDFG, RegDefs, CurAddr, FuncType) 
if write back bit is set then 

// write back the updated base register 

GenRegWrite(CDFG, RegDefs, base register Rn, addr^op, main result port, 

FuncType) 

endif 
else 

// the address is post-indexed so just use the base register for the 
// address 

addr_op - GenRegRead(CDFG, RegDefs, Rn operand, FuncType) 
endif 

// perform the actual memory access - adding more dependencies as required 
//to maintain semantically correct memory operation order 
if instruction is a store then 

// read the source register for a store - shift to the correct position 
// for a swubword write 
• reg_op = GenRegRead (CDFG, RegDefs, Rd operand, FuncType) 
if store is for byte or half word then 

reg_op » GenerateOperat ion (shifter unit, byte shift operation, 

reg op, addr op) 

endif 

op = GenBinaryOp(CDFG, memory unit, store method, addr op, reg op) 
else — ' ff • 

// For a load read the location and then shift to the correct byte (and 
// sign extend if required) if a subword access. Write the value back to 
// the required architectural register. 

op = GenUnaryOp(CDFG, memory unit, load method, addr op) 
if load is for a byte or half word then 

op « GenBinaryOp(CDFG, byte shifter, byte shift and sign extend, 

op, addr op) 

endif ~ ^ 

en<S"^^^^^"^^^^-^^' RegDefs, Rd operand, op, main result port, FuncType) 
AddMemoryDependencies (CDFG, op, FuncType) 

// if the address is post indexed then generate the new address and write 

// It back to the base register (it is always written back in the 

// post-index case) 

if address is post-indexed then 

addr_op = GenerateAddress (CDFG, RegDefs, CurAddr, FuncType) 
GenRegWrite (CDFG, RegDefs, base register Rn, addr^op, main result port, 

FuncType) 

endif 



9.4.9 Multiple Memory Access Translation 

void TranslateAccessMultiple(CDFG, RegDefs, CurAddr, FuncType) 
// Translates a multiple load or store instruction. This allows any number 
// of the architectural registers to be stored or loaded into consecutive 
// memory locations. This is generally used to preserve non-volatile 
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// oeSoS'^LtJ? k/"?""""- the instruction may also be used to 

// nS^ereHegSte" rllT.isLlTr.ll I' higher"* 

// the return L the ' Pc\%;iIt:ris":r^tteTtr °" '^"^^ ^^^^^ 

// " current register definitions 

// ^ t address of the CPSR access instruction 

// FuncType :ls the type of the function being schl^led 

if register list is not empty then 

ft^tJ^?, V^^^ register - offsets are generated from this for the 
// be'iSSerer'"- '^^""^^ the'^cceSes to 

base_op =. GenRegRead{CDPG, RegDefs, Rn operand, FuncType) 

iffset'*=*'o ^^"^ °^ registers in the list 

for reg_nuni = 0 to 15 do 

if reg_nuni is in register list then 

// generate the next inmiediate offset 
if U bit is set (increment address) then 

new^offset «■ offset + 4 
else 

new^offset - offset - 4 
endif 

a p\?r«^**r^"^/! *'f^"^ ""^^ "P'^te the offset 

if P bit set (pre-index) then 

offset - new offset 

endif ~ 

// generate the required address - if the offset o <-v,o« <^ • ^ 
imnied_op = Genlmmediate (CDFG, offset, false) 

addr_op =. GenBinaryOp(CDFG, addition unit, add method, base op, 

immed op) ~" 
else — ^ 

addr_op « base_op 
endif 

// perform the actual individual load or store operation 
Li A "r^*;, ^"^'^ multiple is being performed) thL 
'^'^-''Pj-^':^^^^^^yOp(CDFG, memory unit, load, addr op) 
GenRegWrite(CDFG, RegDefs, reg_num, mem_op,'main 7efult port, 

else FuncType) 

Sm^^o"^ = GenRegRead(CDFG, RegDefs, reg num, FuncType) 
^^em_op = GenBinaryOp(CDFG, memory unit,-store, addfoi, data_op) 

//' c:rr:S\perati^'order°'^ dependencies to maintain semantically 
AddMemoryDependencies (CDFG, mem^op, FuncType) 

// the offset is updated 
offset « new offset 
endif ~ 
endfor 

W^S^^^L^/"^f^ ^^^^ register if required 
if W bit xs set (write back base register) then 
irmed_op = Genlmmediate (CDFG, offset, false) 

addr_op - GenBinaryOp(CDFG, addition unit, add method, base_op, 

GenRegWrite(CDFG, RegDefs, Rn operanT^^ddr^op, main result port, 

FuncType) 
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endif 
endlf 



SiAlO Swap Translation 

void TranslateSwap (CDFG, RegDefs, CurAddr, FuncType) 

// Translates swap instructions." These are atomic memory operations used 
// to swap two values in memory in order to implement semaphore operations. 
// The load and the store must not be seperated. Since this is encoded as a 
// single instruction it is guaranteed that the CriticalBlue operations will 
be 

// part of the same strand - and thus remain atomic. 
// RegDefs is the current register definitions 
/•/ CurAddr is the address of the CPSR access instruction 
// FuncType is the type of the function being scheduled 

// load the present value in the memory location - if it is a 

// byte value then the byte shifter unit is used to extract the individual 

// byte 

base_addr = GenRegRead (CDFG, RegDefs, Rn operand, FuncType) 
load_op « GenUnaryPp(CDFG, memory unit, load, base_addr) 
AddMemoryDependencies (CDFG, load^op, FuncType) 
if load is for a byte or half word then 

load_op = GenBinaryOp(CDFG, byte shifter, byte shift and sign extend, 

load op, base addr) 
endif " - 

// Write the new value to the location - if it is a byte value 
// then the byte shifter is used to place the byte in the correct 
// part of the word prior to the write. The value read by the load 
//is written back to the register - after the register read to 
•// preserve the semantics if the same register is used for both. 
st_data = GenRegRead(CDFG, RegDefs, Rm operand, FuncType) 
if swap is for a byte then 

st^data « GenBinaryOp(CDFG, byte shifter, byte shift operation, 

st_data, base addr) 

endif 

GenRegWrite {RegDefs, RegDefs, Rd operand, load_op, main result port, 

FuncType) 

store_op = GenBinaryOp (CDFG, memory unit, store, base_addr, st data) 
AddMemoryDependencies (CDFG, store_op, FuncType) 



9.4.11 Status Word Access Translation 

void TranslateCPSRAccess(CDFG, RegDefs, CurAddr, FuncType) 

// Translates Current Program Status Register (CPSR) accesses. Only the 

// condition code flag bits are modelled in the CriticalBlue architecture - 

// all other bits are ignored. Each of the flag bits is stored in an 

// individual flag register in the CriticalBlue architecture. Code is 

emitted 

//to convert this representation to/from the standard bit positions in 
// the CPSR. 

// RegDefs is the current register definitions 

// CurAddr is the address of the CPSR access instruction 

// FuncType is the type of the function being scheduled 

if operation is CPSR read then 

// The flag bits are actually stored in individual registers - this 
// combines them into a single value equivalent to what would be read 
// from a real CPSR register. A basic set of flags (for other status 
// information) is combined with the results of reads for each of the 
// required flag registers 

base_op = Genlmmediate (CDFG, default value for CPSR - all flags reset, 

"true) 
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foreach of the flag registers N, 2, C and V do 
read the appropriate register 

en.^ol'"'' ''^^'^ ^""^ '° t° appropriate part of base_op 

GenRegWrite(CDPG, RegDefs, Rd operand, last update operation, 

elae if immediate write then "^^^ ^"^^ype) 

flag_iinm = the iimnedlate value to be loaded into CPSR 
foreaoh of the flag registers N, Z, C and V do 

endfor 
else 

read the appropriate flag bit from reg op 
enc^fr ""^^ °^ ""^^ individua^flag register 



endif 



9^12 Address Generation 



Node GenerateAddress(CDFG, RegDefs, InstrAddr, FuncType) 

st-nir^Lf Li?^-^^ - : a^:L. Handles both 

// data from PC offsets are ?rans?2L ?^« ° T^''^ access. Accesses to 
// location. Returns tL^ldSl^^^i^S; l^lT.llV^^^ '^"^ '^^^'^^ 
// f ^v^fL".""^^ current register definition! 
// ^ .^^ instruction being translated 

// FuncType xs the type of the function being scheduled 



if instruction offset is immediate then 
if base register Rn is PC then 



// WdJate fdlif!"^'^^;^^ ^" inm>ediate offset is transformed into an 

xf addr is not in a data area then yuj-tea; 

generate an error 
endif 

addr_op = Genlmmediate (CDFG, addr, true) 
mark addr_op as being a fixup node 

*^7/Yf "STri"^-^* ° (handling split field if required) then 

// dfrecwy " ° '^^^^ ^^'^ "^^^^ ^^^Ister ca^n be used 

el8e****''-°^ ' ^^"^«5Read(CDF6, RegDefs, Rn operand, FuncType) 

V/ IL^^^^ an immediate offset - this is translated into an explicit 

// o??J?"^ °" ^''^ immediate and the base register^^ 

// offset IS negated if this is a down offset ^g^s^r- Tne 

inuned.op = Genlmmediate (CDFG, immediate value negated if U bit reset 
(handling split field if required), false) 

ll^-ll : GenB-^'^'^^^o'^' «^ operand? FuncType) " 

^^ddr_op GenBinaryOp(CDFG, addition unit, add, base_^, immed_op) 

else if shift value is 0 (i.e. two register ops) or 

// M= • J ^^-"-^ word/signed byte access then 

// we have two register offsets - these are combined with an addition 
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//or subtraction as required 

opl « GenerateRegRead(Rn operand) 

op2 = GenerateRegRead(Rin operand) 

if U bit is reset (i.e, s\ibtraction) then 

addr_op = GenBinaryOp (CDFG, addition unit/ add, opl, op2) 
else 

addr_op = GenBinaryOp (CDFG, sxabtraction unit, subtract, opl, op2) 
endlf 
else 

//we have a register offset that is shifted by a fixed amount - code 
// for an immediate shift is generated 

op2 = GenShiftImmed(CDFG, RegDefs, shift type from Sh field, 
Rm operand, shift amount, false, FuncType) 
opl « GenRegRead(CDFG, RegDefs, Rn operand, FuncType) 
if U bit is reset (i.e. subtraction) then 

addr_op = GenBinaryOp (CDFG, addition unit, add, opl, op2) 
else 

addr_op « GenBinaryOp (CDFG, subtraction unit, subtract, opl, op2) 
endif 
endif 

return ad£lr_op 



9i4^13 Immediate Shift Generation 

Node GenShiftImmed(CDFG, RegDefs, ShiftType, SourceReg, ShiftAmount, 

GenFlags, FuncType) 
// Generates a shift operation by a fixed amount in an immediate value. 
// Uses the byte and bit shifters backr to-back to accomplish the full 
// shift. Writes the carry flag -if required. Returns the data producing 
// operation. 

// RegDefs is the current register definitions 

// ShiftType is the type of shift to be performed 

// SourceReg is the register holding the data to be shifted 

// ShiftAmount is the fixed amount to be shifted 

// GenFlags is true if the carry flag should be written back 

// FuncType is the type of the function being scheduled 

// read the data to be shifted 

data_op = GenRegRead(CDFG, RegDefs, SourceReg, FuncType) 

// if the shift is greater than 7 then a byte shift is required 
if ShiftAmount requires a byte shift (greater than shift of 7) then 
amount_op = Gen Immediate (CDFG, byte shift amount, false) 
data_op « GenBinaryOp (CDFG, Byte Shift Unit, ShiftType, data_op, 

amount op) 

endif 

// if the shift amount is not a multiple of 8 then a bit shift is required - 
// this may require a link to any earlier byte shift in order to produce the 
// carry flag 

if ShiftAmount requires a bit shift (not multiple of 8) then 
amount_op = Genlramediate (CDFG, bit shift amount, false) 

data_op « GenBinaryOp (CDFG, Bit Shift Unit, ShiftType, data_op, amount_op) 
endif 

// write the carry flag if required 
if GenFlags then 

RegWrite (CDFG, RegDefs, Register holding C flag, data_op, carry output 

port, FuncType) 

endif 

return data^op 
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9^1 4 Register Shift GeneiBtion 

Node GenShif tReg (CDFG, RegDefs, ShiftType, SourceReg, ShiftReg, GenFlags, 

FuncType) 

// Generates a shift operation by a variable amount as stored in a register. 
// Uses the byte and bit shifters back-to-back to accomplish the full 
// shift. Writes the carry flag if required. Returns the data producing 
// operation. 

// RegDefs is the current register definitions 

// ShiftType is the type of shift to be performed 

// SourceReg is the register holding the data to be shifted 

// ShiftReg is the register holding the amount to be shifted 

// GenFlags is true if the carry flag should be written back 

// FuncType is the type of the function being scheduled 

// load the shift amount and the data to be shifted 
amount_op = GenRegRead (CDFG, RegDefs, ShiftReg, FuncType) 
data_op = GenerateRegRead(CDFG, RegDefs, SourceReg, FuncType) 

// perform a byte shift followed by a bit shift - may need additional 
// output from byte shift to bit shift for the carrt 
data__op = GenBinaryOp (CDFG, Byte Shift Unit, ShiftType, data_op, 

amount_op) 

data_op = GenBinaryOp (CDFG, Bit Shift Unit, ShiftType, data_op, amount_op) 

// if flags should be generated then they come from the carry output of. 
// the bit shift operation (this must reproduce the carry out from the 
// byte shift if there is no bit shift) 
If GenFlags then 

GenRegWrite (CDFG, RegDefs, Register holding C flag, data_op, carry output 

port, FuncType) 

endi£ 

return data_op 



9.4.15 Cortdition Evaluation Generation 

void GenCondEval (CDFG, RegDefs, ConditionType, FuncType) 

// Evaluates a particular condition and generates a squash operation based 
// on that condition. This is used to handle condition instructions. The 
// condition is evaluated and the body of the instruction is put into a 
// following strand whose execution is controlled by the squash. The 
// condition is determined from a table depending on the ConditionType 
// that loads the required individual architectural flag registers. 
// ConditionType gives the type of condition as a 4 bit ARM format 
// condition field 

// a chain of operations is performed to evaluate the condition from the 
// individual condition registers 
1£ condition is not always "then 
1£ condition is never then 

// generate a known false state 
flag_state = Genlmmedi ate (CDFG, 0, true) 
else 

// the first condition from a table is loaded 
first_load = initial operation required for condition 
flag^state «= GenRegRead (CDFG, RegDefs, register required for 

first_load, FuncType) 

// the subsequent conditions are combined as required 
foreach load_req for the ConditionType do 

combine^state « GenRegRead (CDFG, RegDefs, register required for 

combination, FuncType) 

flag__state « GenBinaryOp (CDFG, Logical Unit, AND, OR or XOR as 
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required by state table, flag_state, combiners tate) 
if state table requires an inversion then 

flag_state =- GenUnaryOp(CDFG, Logical Unit, NOT, flag state) 
endi£ — 
endfor 
endif 

// The actual squash operation is generated - it is made the current 
// squash operation for the current strand. This allows subsequent 
// branch operations to determine which squash controls their execution - 
// this is used if a branch is deleted if the destination is internal to 
// the region 

squash_op » GenUnaryOp (CDFG, squash unit, true or inverse depending 

on required flag state, flag_state) 
mark squash^op as the squash instruction for the current strand - this may 
be carried into subsequent strands 

endif 



9d4.16 PC Update Generation 

Node GenPCUpdate(CDFG, RegDefs, InstrAddr, FuncType) 

// Updates the value of the PC architectural register for the 

// current source instruction. This update is performed for each 

// source instruction. For main execution code these updates are 

// optimised away. They are maintained for the shadow code to 

// keep the PC value up to date and to provide the input to 

// the instruction predicate unit .(which is able to perform 

// breakpointing) . The PC value is formed by loading it with an 

// immediate value of the original instruction address. 

// Returns the operation that produces the updated PC value. 

// RegDefs is the current' state of the register definers 

// InstrAddr is the address of the current instruction 

// FuncType is the type of the function being scheduled 

// load the PC value as an immediate value 
instr_addr = Genlmmediate (CDFG, InstrAddr, true) 
mark instr_addr as requiring a fixup 

// write the current PC value into the PC register 
GenRegWrite (CDFG, RegDefs, PC register, instr_addr, main port) 

return instr addr 



9.4.17 Instruction Predicate Generation 

void GenlnstrPredicate (CDFG, PCUpdateNode) 

// Instruction predicate nodes are used to guard all original instructions. 

// An instruction predicate is generated just before the operations 

// associated with a particular original instruction. These 

// are always optimised out of the main code but remain present for the 

// shadow code. They access the breakpoint comparison unit to con^are 

// original PC values. This provides a predicate to guard all subsequent 

// execution of the strand and an interrupt is generated upon a match 

//to invoke the monitor to handle the breakpoint. The instruction 

// predication is a dependent upon all operations in the previous 

// instruction of the strand. The predicate instruction is a 

// dependee of all operations for the following original instruction. 

// PCUpdateNode is the node generating the updated PC value for the 

// instruction about to be translated 

pred_op = GenUnaryOp(CDFG, instruction predicate unit, PCUpdateNode) 
foreach prev_node in the CDFG do 

if prev_node 1= pred_op and prev_node is in the current strand then 



70 



Instructio^^t 



CrfticalBlue Instruction^t Translation 

create a strong control arc from prev node to pred_op 
endxf *" 
endfor 

add pred_op as the current instaruction predicate for the CDEX5 (this 
is then added as a predecessor for subsequent instructions) 



9^18 Register Read Generation 

Node GenRegRead(CDFG, RegDefs, SourceReg, FuncType) 

// Generates a read of an architectural register. The register definition 
// information is used to generate dependencies to the write (s) that 
// generate the value in the register. There may be multiple writers if 
// the control flow is confluenced. These dependencies prevent the read 
// operation being issued before the write. Returns the read node. 
// RegDefs is the current state of the register definitions 
// SourceReg is the architectural register to be read 
// FuncType is the type of the function being scheduled 

if reg_n\jm is the PC register then 

// reading from the PC register is a special case - we convert it into 
// an immediate load that is fixed up in a later pass to give the correct 
// value .even if the code is relocated 

read_op =• Genlmmediate (CDFG, PC value of instruction + 8/ true) 
mark the read_op as a fixup node 
else 

// read the appropriate register - the register file may be partitioned 
// into statically to improve parallelism 

reg_unit = the appropriate register unit depending on SourceReg 
reg_num = the appropriate register number for the selected unit based on 

SourceReg 

read_op «= GenOp(CDFG, reg_unit, reg_nura + read selection bit, empty) 

// generate any required dependencies to earlier writes of the register 
foreach prev_def that is a previous definition in RegDefs do 
if prev_def is in the same strand or FuncType « Critical then 
// a strong ordering is required 

if there is only one previous definition of the register then 
// if there is only a single reaching definition then it is 
// marked as a tunnel arc for potential l.ater optimisation 
generate tunnel arc from prev_def to read_op 

else 

// a strong arc ensures that the read is not issued before 
// any of the earlier writes 

generate a strong control flow arc from prev_def to read_op 
endlf 
else 

//a weak arc is generated that aborts the reading strand if the 
// dependency is violated 

generate a weak control flow arc from prev_def to read_op 
generate a corresponding conditional arc from prev_def to the 
guard operation associated with the current strand 

endlf 
endfor 



// generate an information arcs to My first use of the register in the 
// strand - this may be used later to rewrite the graph to only read 
// the register once in the strand 

if there is a first use of the register in RegDefs then 

generate an information arc from the first use to read_op 
else 

make read_op the first use of the register in RegDefs 
endlf 



endlf 



return read_op 
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9AA9 Register Write Generation 

void GenRegWrite (CDFG, RegDefs, DestReg, DataSource, DataPort/ FuncType) 

// Generates a write to an architectural register. This generates the write 

// operation and also updates the register definition information. This is 

// used to keep a record of which nodes have updated the architectural 

// registers in order to maintain dependencies between them. Control arcs 

// are generated between writes to the same register in the original program 

// order within a strand. This ensures that live out register values are 

// preserved by keeping the same order of writes to architectural registers. 

// Arcs to writes in preceeding strands are made weak unless the CDFG being 

// constructed is critical. A conditional arc is constructed to the guard 

// operation for the strand - so that if the register write order is 

// violated then previous strands must coiranit before committing the current 

// one. This function also generates code for return, indirect function 

// calls and switch branches. These are instruction that modify the PC 

// value. 

// RegDefs is the current state of the register definitions 

// DestReg is the architectural register that is to be written 

// DataSource is the node that generates the data to be written 

// DataPort is the port of the DataSource where the data is available 

// FuncType is the type of the function being scheduled 

// perform the actual register write - the architectural registers may be 
// partitioned statically between a number of individual register units 
reg_unit = the appropriate register unit depending on DestReg 
reg_num = the appropriate register number for the selected unit based on 

DestReg 

param = build parameter with DataSource and DataPort 

write_op = GenOp(CDFG, reg_unit/ reg^num + write selection bit, param) 

// if a register is written then the first use information is cleared 
clear the first use entry associated with the register 

// generate the required dependency arcs to previous writes of the same 
// architectural register to maintain the write order 

if there is a previous list of definitions for the register in RegDefs then 
foreach prev_def that is a previous definition do 

if prev_def is in the same strand or FuncType = Critical then 
// strong write order is maintained in the same strand or in 
// critical functions 

generate a strong control flow arc from prev_jief to write_op 
else 

// we generate a weak arc to the previous write and a 
// conditional arc to the guard for the current strand - thus 
// if the write order is violated then the current strand is 
// aborted unless it is the first being executed 
generate a weak control flow arc from prev_def to write_op 
generate a corresponding conditional arc from prev_def to the 
guard operation associated with the current strand 

endlf 
endfor 
endif 

// any write to the PC register is actually a branch in disguise - these are 
// used to implement indirect calls, returns and branches for switch code 
if DestReg is the PC register then 

// if this is a special function type then there may be particular 

// register restores required for a return 

if instruction is not marked as an indirect call then 

if FuncType = Slowlnterrupt or FuncType = Fastlnterrupt then 

// if this is an interrupt function then we need to generate for the 
// code to reload the volatile registers 

foreach volatile register, stack pointer and condition code reg do 
generate code to reload register into main register file 
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endfor 

else ±€ FuncType » SWIHandler then 

// 'if this is a return from a SWI handler then we need to restore 
// the appropriate registers 
restore the condition code registers 
restore the link registers from rl4_svc 
restore the stack pointer from rl3_svc 
endxf 
endif 

// The new PC value will point to an indirect link that will actually hold 
// the address to be returned to. We generate an operation to load it. In 
// the case of a return the link is stored just after the call site. In 
// the case of an indirect call it is in the first word of the function to 
// be called. In the case of a branch for switch code each possible 
// destination has a link. 

load_op « GenOp(CDFG, memory unit, load, param) 
GenBranch (CDFG, load__op) 
esdif 

/J the register definitions need to be updated with the new write 
overwrite the current value of the register in RegDefs with the write_op 

// if we have modified any of the condition code registers responsible for 
/J the current strand condition then we need to clear the current 
// condition - this handles conditional operations that actually update the 
// condition codes themselves so a subsequent instruction with the same 
// condition guard needs a new strand 

x£ DestReg is one of the condition code registers and the condition code 
is relevant to the entry condition of the strand then 
clear the entry condition of the strand 
endif 



9AJ20 Immediate Value Generation 

Node Genlramediate (CDFG, ImmedValue, StrandReq) 

// Generates an operation to load the given immediate value. All immediates 
// are loaded in a single operation. A number of immediate units are 
// made available. The narrowest possible immediate unit is choosen. 
// The operation node node in the CDFG is returned. 

// StrandReq is true if an RSN is required as part of the immediate field - 
// this is the case if the immediate feeds into the first operand of its 
// constimer 

foreach immed_unit from narrowest to widest do 

if ImmedValue can be represented in the width of immed_unit and 

RSN field is present if StrandReq is asserted then 
// generate an operation to load the immediate value - the method 
// port specifies the value for the immediate unit 
imm_op « GenOp(CDFG/ immed_unit, ImmedValue, 

empty (no other parameters) ) 

return imm op 
endif " 
endfor 

// all immediate values can be represented - other than those using an ARM 
// shift into the most significant bits but those are handled at a higher 
// level 
report ■ an error 



9.421 Branch Generation 

void GenBranch (CDFG, DestPCValue) 
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// Generates a branch operation with the given original destination PC 
// value. The branch may inqplement a call, original branch in the code 
// or a call to the handler for a software interrupt. The squash node 
// that controls the execution of the current strand is stored so that 
// if the branch is later deleted (because it turns out the destination 
// is inside the region) then the squash operation can be updated 
// appropriately 

// DestPCValue is the destination of the branch in terms of the original 
// PC value 

// Generate the destination address - this is a full destination 

// including strand nxiraber and address. It is handled as a fixup 

// node so that it is calculated after the binary has been generated to 

// allow differening values without forcing re-scheduling of a function. 

fixup_node = GenOp(CDFG, Wide Immediate Unit, DestPCValue) 

mark fixup_node as being a fixup node 

// the branch operation itself is generated 

branch_node - GenUnaryOp (CDFG, branch unit, initiate branch, fixup_node) 
store information about the squash node that controls the execution of 
the current strand 



Node GenUnaryOp (CDFG, Unit Type, UnitMethod, ParamNode) 

// Generates an operation requiring a single parameter in addition to 

// the method 

// UnitType is the type of unit required 

// UnitMethod is the method to be executed on the unit 

// *ParamNode is the operation which is the source of the parameter - the 
// main result port is used 

param__list - build parameter from ParamNode main result port 
return GenOp(CDFG, UnitType, UnitMethod, param_list) 



9i4.23 Binary Operation Generation 

Node GenBinaryOp (CDFG, UnitType, UnitMethod, LeftNode, RightNode) 
// Generates an operation requiring a two parameters in addition to 

// the method 

// UnitType is the type of unit required 

// UnitMethod is the method to be executed on the unit 

// LeftNode is the operation which is the source of the first parameter - 
// the main result port is used 

// RightNode is the operation which is the source of the first parameter 
// the main result port is used 

param^list = build parameter from LeftNode main result port 
param^list » add build parameter from RightNode main result port 
return GenOp{CDFG, UnitType, UnitMethod, param_list) 



Node GenOp(CDFG, UnitType, UnitMethod, ParamList) 

// Creates a new operation and adds it to the CDFG. A list 

// of parameters is specified (that may be empty) which show 

// the operands of the operation, 

// UnitType is the type of unit required 

// UnitMethod is the method to be executed on the unit 

// ParamList is the list of required parameters - each has a 

// source node and the port from the node on which the value 



9AJ22 Unary Operation Generation 



9AJ2A Operation Generation 
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// is available 

new^node « create a node of the required UnitType with UnitMethod 
in the CDFG 

if UnitMethod cannot be issued speculatively tlien 

// If the operation cannot be issued speculatively then the 
// phase barrier for the strand is made to depend on it to ensure 
// it is only issued during the commit phase. The arc indicates 
// the number of clock cycles to resolve a commit, 
create control flow arc from current strand phase barrier to 
new node with a latency for commit resolution 

endxf 

// all emitted operations depend on the most recent instruction 
// predicate 

create control flow arc from the instruction predicate for the 
current instruction to new_node 

// create the data inflows to the operation as data arcs 

foreach operand in ParamList do 

create a data arc from the source node and port in ParamList - the arc 
is labelled with the latency of results from the unit (also add 
the relevant transport costs - from transport delay in node) 

endfor 

// Each method may be a member of a number of ordering sets. All methods 
// within each set must maintain the same ordering as in the original 
// program. This allows dependencies between operations to be taken into 
// account- Control arcs are generated to ensure the same ordering is kept. 
// Virtual registers in RegDefs (not actually shown as accessible here) 
// are used to keep the ordering information. The confluence operations 
// performed on RegDefs also confluence the ordering sets, 
foreach order_set that the UnitMethod is a member of do 

generate a control arc to the last instruction issued in set (actually 

use virtual registers in RegDefs that needs to be made accessible to 

this routine) 

add new__node to the appropriate virtual register in RegDefs 
endfor 

return new node 



9.4.25 Memory Dependency Addition 

void AddMemoryDependencies (CDFG, NewNode, FuncType) 

// Adds memory dependencies in the CDFG for the NewNode. Looks at all 
// previous store nodes that are in strands that are reachable from the 
// current one (or are in earlier parts of the same strand). Generates 
// dependency arcs as required. Also generates chaz operation nodes 
//as appropriate to allow loads to migrate earlier than potentially 
// aliased stores. Memory dependencies on call instructions are not 
// required as operations cannot be migrated earlier then a call as 
// the return causes a region restart. 

// NewNode is the load or store node for which dependencies to earlier 
// memory accesses need to be added 

// FuncType is the type of the function b6ing scheduled 

foreach prev_node in the CDFG do 

if prev_node is in a reachable strand from current one in CDFG then 
if prev^node is a store then 

// the node is a store and is reachable from the current strand so 
//we determine if they might be aliased 
dep_req = none 

alias « AliasCheck (prev_node, NewNode) 
if alias = PartialAlias or alias « FullAlias then 
// there is a definite alias 

if prev_node is in the current strand or FuncType = Critical or 
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NewNode is a store then 
// dependencies in the current strand, dependencies in a 
// critical function and store to store dependencies must be 
// strongly ordered 
dep_req = Strong 
else 

//if the store-load are issued out of order then we must 
// abort the strand holding the load 
dep__req = CondAbort 
endif 

else if alias « MaybeAlias then 

// the accesses may be aliased ~ - 

if prev_node is in the current strand or NewNode is a store then 

// store-store dependencies or dependencies within the same 

// strand are strongly ordered even if the alias is only ^ 

// potential 

dep_req « Strong 
else 

// if the store-load are issued out of order then we must issue 
// a check hazard operation to abort the load strand if there 
// was an actual alias 
dep^req = CondChaz 
endif 
endif 

// generate the dependency as required 

if dep_req = Strong then 

//we have a strong alias - if a load is definitely loading 

// data from an earlier store then a tunnel arc is generated that 

// may be used by the CDFG optimiser to eliminate the load 

// operation 

if alias = FullAlias auxd NewNode is a load then 

create a tunnel arc between prev_node and NewNode 
else 

create a control arc between . prev^node and NewNode 
endif 

else if dep_req =» CondAbort then 

//we generate a weak arc that aborts the load strand if it is 

// violated 

create a weak control arc between prev_node and NewNode 
create a corresponding conditional arc between prev_node and the 
guard operation for the current strand 
else if dep_req = CondChaz then 

// we create the chaz operation and put it into the speculative 
// phase of the current strand 
chaz_node = a new chaz operation node 

add the chaz_node to the CDFG and make it dependent on the phase 
phase barrier for the current strand (so that it has to be 
issued before it) 

// the store-load weak dependency arc is created - a conditional 
// arc to the chaz operation is activated if the dependency order 
// is violated 

create a weak control arc between prev_node and NewNode 
create a corresponding conditional arc between prev_node and 
chaz__node 

//we get the nodes generate the store and load addresses and 
// route them to the chaz operation as well - the load address 
//is made conditional since it is an acycle arc (the load 
// may be forced into the committed phase of the current strand 
// by instruction predicates but the chaz must be in the 
// speculative phase) 

st__addr_node « the node generating the address for the store 
ld_addr_node = the node generating the address for the load 
create a data flow arc from st_addr_node to the chaz^node 
create a conditional data flow arc from ld_addr__node to the 
chaz node (this is created as a conditional node as it is 
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CriticalBlue ^ Instructioi^ Translation 

acyclic) 



endif 
endif 
endfor 



9.4^6 Memory Alias Checking 

enum AliasType { . . ^ - ^ 

NoAlias, // the operations are definxtely not aliased 

MaybeAlias, // the is not enough information to tell if the accesses 
// are aliased 

PartialAlias, // the operations definitely overlap but only partially 
FullAlias} // there is full overlap between the operations and they 
// are of the same size 

AliasType AliasCheck (PrevNode, CurNode) ^ ^ ^ 

// Performs an alias check between two nodes in the CDFG. Each of the nodes 

// should be a load or store operation. 

// PrevNode is the earlier memory access node 

// CurNode is the later memory access node 

if PrevNode and CurNode both use addresses calculated from additions and 
PrevNode and CurNode both have immediate offsets for the additions and 
PrevNode and CurNode both use the same base register (generated from 

the same node) then 
// both accesses are immediate offsets from the base address - this 
// will catch most stack accesses and also accesses to the same 
// structure or class 
prev_offset « offset for PrevNode 
cur offset = offset for CurNode 

if irev_of fset and cur^offset are identical as are the access sizes then 
// the accesses are completely aliased 

return FullAlias , . 

else if prev__offset and cur__offset could overlap given access sizes then 

// the accesses are partially overlapped 

return PartialAlias 
else 

// the accesses are definitely not aliased 
return NoAlias 
endif 

else if PrevNode and CurNode both have the same address generator -ttien 

// this handles cases where the address does not come from an addition but 

// they both have exactly the same base register 

if PrevNode and CurNode both have the same size then 

// it is a full alias if the access sizes are also identical 

return FullAlias 

^^^// if the access sizes are not identical then the overlap is only 
// partial 
return PartialAlias 
endif 
endif 

// we do not have enough information to determine if the accesses are 
// aliased or not 
return MaybeAlias 
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